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Abstract 



Cross-validation (CV) is a popular method for model-selection. Unfortunately, it is 
not immediately obvious how to apply CV to unsupervised or exploratory contexts. 
This thesis discusses some extensions of cross-validation to unsupervised learning, 
specifically focusing on the problem of choosing how many principal components to 
keep. We introduce the latent factor model, define an objective criterion, and show 
how CV can be used to estimate the intrinsic dimensionality of a data set. Through 
both simulation and theory, we demonstrate that cross-validation is a valuable tool 
for unsupervised learning. 
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Chapter 1 
Introduction 



Cross-validation (CV) is a popular method for model selection. It works by estimat- 
ing the prediction error of each model under consideration and then choosing the 
model with the best performance. In unsupervised contexts, though, there is no clear 
notion as to what exactly "prediction error" is. Therefore, it is difficult to employ 
cross-validation for model selection in unsupervised or exploratory contexts. In this 
thesis, we take a look at some extensions of cross-validation to unsupervised learning. 
We focus specifically on the problem of choosing how many components to keep for 
principal component analysis (PCA), but many of the concepts we introduce are more 
broadly applicable. 

Before we can do anything, we need a solid theoretical foundation. To this end, 
Chapter 2 gives a survey of relevant results from multivariate statistics and ran- 
dom matrix theory. Then, Chapter 3 derives the behavior of the singular value 
decomposition (SVD) for "signal-plus-noise" matrix models. These two chapters are 
complemented by Appendices A and B, which collect properties of random orthog- 
onal matrices and give some limit theorems for weighted sums of random variables. 
Collectively, this work provides the groundwork for the rest of the thesis. 

In Chapter 4, we introduce the latent factor model. This is a generative model 
for signal-plus-noise matrix data that expands the setup of PCA to include correlated 
factor loadings. We motivate loss functions for estimating the signal part, and then 
show how the SVD performs with respect to these criteria. 
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CHAPTER 1. INTRODUCTION 



The next chapter (Chapter 5) focuses on cross-validation strategies. It covers both 
Wold-style "speckled" hold-outs as well as Gabriel-style "blocked" hold-outs. We 
define model error and prediction error for the latent factor model, and present the 
two cross-validation methods as estimators of prediction error. The chapter includes 
a comparison of CV methods with parametric model-selection procedures, showing 
through simulation that CV is much more robust to violations in model assumptions. 
In situations where parametric assumptions are unreasonable, cross-validation proves 
to be an attractive method for model selection. 

Chapter 6, the final chapter, contains a theoretical analysis of Gabriel-style cross- 
validation for the SVD, also known as bi-cross- validation (BCV). This chapter shows 
that BCV is in general a biased estimator of model error, with an explicit expression 
for the bias. Despite this bias, though, the procedure can still be used successfully for 
model selection, provided the leave-out sizes are chosen appropriately. The chapter 
shows how to choose the leave-out sizes and proves a weak form of consistency. 

Cross-validation is a valuable and flexible procedure for model selection. Through 
theory and simulation, this thesis demonstrates the applicability and utility of cross- 
validation as applied to principal component analysis. Many of the ideas in the 
following chapters generalize to other unsupervised learning procedures. This thesis 
shows that cross-validation can successfully be used for model selection in a variety 
of contexts. 



Chapter 2 



Multivariate statistics background 



Multivariate statistics will prove to be a central tool for this thesis. We use this 
chapter to gather the relevant definitions and results, concentrating mainly on the 
eigenvalues and eigenvectors from sample covariance matrices. The literature in this 
area spans over fifty years. We give the basic definitions and properties of the mul- 
tivariate normal and Wishart distributions in Section 2.1. Then, in Section 2.2, we 
survey classical results about sample covariance matrices when the number of di- 
mensions, p, is fixed and the sample size, n, grows to infinity. This material is by 
now standard and can be found in any good multivariate statistics book (e.g. Muir- 
head [61]). Lastly, in Section 2.3, we survey modern asymptotics, where n — > oo and 
p grows with n. Modern multivariate asymptotics is still an active research topic, 
but today it is possible to give a reasonably-complete description of the objects of 
interest. 



2.1 The multivariate normal and Wishart distri- 
butions 

We start with the definition of the multivariate normal distribution and some basic 
properties, which can be found, for example, in Muirhead [61] [Chapters 1-3]. 

Definition 2.1 (Multivariate Normal Distribution). For mean vector p, £ MP and 
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positive-semidefinite covariance matrix S e IR pxp , a random vector X G W is said 
to be distributed from the multivariate normal distribution, denoted X ~ M (//, S) i/ 
/or every /ixerf vector a E W , the vector a T X has a univariate normal distribution 
with mean a T /i and variance a T Sa. 

The multivariate normal distribution is defined for any positive-semidefinite covari- 
ance matrix S, but it only has a density when £ is strictly positive-definite. 

Proposition 2.2. J/I e l p follows a multivariate normal distribution with mean \i 
and positive- definite covariance matrix S ; then its components have density 

f(x) = (2n)-^\i:\- l/2 exp (-±( X - B ) T V-\x - rf) . (2.1) 

A basic fact about the multivariate normal is the following: 

Proposition 2.3. Let aeW, C e R qxp , and X ~ M (/x, S) . Define Y = CX + a. 
Then Y ~ M (C/x + a, CSC T ) . 

Two immediate corollaries are: 

Corollary 2.4. Suppose that X ~ A/"(0, o- 2 I p ) and that O G W xp is an orthogonal 
matrix. Then OX = X- 

Corollary 2.5. If X ~ N (0, E) and £ = CC T for a matrix C E W xp , and if 
Z~JV(0, I p ), then CZ = X. 

We are often interested in estimating the underlying parameters from multivariate 
normal data. The sufficient statistics are the standard estimates. 

Proposition 2.6. Say that Xi,X 2 , ■ ■ ■ ,X n o,re independent draws from a Af (/i, S) 
distribution. Then the sample mean 
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and the sample covariance 

1 n 

are sufficient statistics for fi and X . 

To describe the distribution of S n , we need to introduce the Wishart distribution. 

Definition 2.7 (Wishart Distribution). Let Xi,X 2 , . . . ,X n e W be an iid sequence 
of random vectors, each distributed as Af (0, X) . Then the matrix 

n 

A = J2X,Xl 

i=i 

is said to have the Wishart distribution with n degrees of freedom and scale parameter 
X. We denote this by A ~ W p (n, X) . 

When n > p and X is positive-definite, the elements of a Wishart matrix have a 
density. 

Proposition 2.8. Suppose that A ~ W p (n, X). If n > p and X is positive- definite, 
then the elements of A have a density over the space of positive-definite matrices, 
given by 

/(A) = 2f| S |!r p (j) exp ('2 tr ( s "^)- (2 ' 4) 

where T p (•) zs £/ie multivariate gamma function, computed as 

r p (5)=^nr(^i). (2.5) 

i=l ^ ' 

We can now characterize the distributions of the sufficient statistics of a sequence 
of iid multivariate normal random vectors. 

Proposition 2.9. Let X n and S n be defined as in Proposition 2.6. Then X n and S n 
are independent with X n ~ A/" (/i, ^X) and (n — l)S n ~ W p (n — 1, X). 
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White Wishart matrices — those with scale parameter S = o 2 I p are of particular 
interest. We can characterize their distribution in terms of eigenvalues and eigenvec- 
tors. 

Proposition 2.10. Suppose that A ~ W p (n, o~ 2 I p ) with n > p and let A = nOLO T 

be the spectral decomposition of A, with L = diag Z 2 , • • • , l p ) and li > l 2 > ■ ■ ■ > 
l p > 0. Then O and L are independent, with O Haar- distributed over the group of 
p x p orthogonal matrices and the elements of L having density 

/ i \ np/2 p 2/ 2 V V ,. 

wf™ni'.-ynr- i,/ v^. ^ 



\^ 2 J Mf)r p (i) 



i<j i=l 



In the random matrix theory literature, with a 2 = 1 the eigenvalue density above is 
sometimes referred to as the Laguerre Orthogonal Ensemble (LOE). 



2.2 Classical asymptotics 

In this section we present results about sample covariance matrices when the sample 
size, n, tends to infinity, with the number of dimensions, p a fixed constant. A 
straightforward application of the strong law of large numbers gives us the limits of 
the sample mean and covariance. 

Proposition 2.11. Let X±, X 2 , ■ ■ ■ , X n be a sequence of iid random vectors in M. p 
with E [Xi] = /i and E (X x — fij (Xi — fj) T = S. Then, as n — > oo ; 

1 " 

X n = — y Xi ^ n 



n 
i=i 



and 



1 n - - T as 

S n = (Xi — Xn) (Xi — Xn) — * S. 

n i 



To simplify matters, for the rest of the section we will mostly work in a setting 
when the variables have been centered. In this case, the sample covariance matrix 
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takes the form S n = ^ Y^i=i XiXl- To see that centering the variables does not 
change the theory in any essential way, we provide the following proposition. 

Proposition 2.12. Let Xi,X 2 , . . . ,X n be a sequence of random observations in W 3 
with mean vector /i and covariance matrix S. Let X n = - J^ILi an< ^ ^ n = 
~ =4 SiLi {-X-i ~ Xn) (Xi — Xn) T be the sample mean and covariance, respectively. De- 
fine the centered variables Xi = Xi — /f Then 

n jri W 

In particular, this implies that 

(S n - S) = £ MI -xJ+o P . 

Proof. We can write 

n 

Sn=^~rJ2 X& + —A*n- I*) {Xn ~ /i) ^ 

n — 1 n — 1 v ~ y v ~ y 

i=i 

= ( - + -r^-nr) E MI + —Ax n - (x„ - /,) T 

7i(7i — 1) J n — l x -' v - y 

The result follows since Ml = O p (1) and X n - y, = O p (j^ ) . 

□ 

The next fact follows directly from the multivariate central limit theorem. 

Proposition 2.13. Suppose that Xi, X 2 , . . . , X n is a sequence of iid random vectors 
in W with 

E [XiXl] = S, 

and that for all 1 < i,j,i',f < p there exists finite 
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If S n — ^ Y17=i Xi^I i then y/nvec (S n — S) —> vec (G) , where G is a random p x p 
symmetric matrix with vec (G) a mean-zero multivariate normal having covariance 
Cov (Gij, Gi'j/) = ^iji'j'- 

In the previous proposition, vec is an operator that stacks the columns of an n x p 
matrix to create an np-dimensional vector. 

If the elements of Xi have vanishing first and third moments (for instance if the 
distribution of Xi is symmetric about the origin, i.e. Xi = —Xi), and if E [XiX^] = 
diag (Ai, A2, • • • , A p ) , then T^ji simplifies to 

F iUi = E [X 4 U ] - \ 2 t for 1 < i < p, (2.7a) 

Fiji, = Fini = E [XlXl 3 ] for 1 < i, j < p, i± j; (2.7b) 

all other values of r^j/ are 0. In particular, this implies that the elements of vec (G) 
are independent. If we also have that Xi is multivariate normal, then 

Fun = 2Ai for 1 < % < p, and (2.8a) 

Fijij = Vijji = XiXj for 1 < i, j < p, i ^ j. (2.8b) 

It is inconvenient to study the properties of the sample covariance matrix when the 
population covariance £ is not diagonal. By factorizing £ = $A$ T for orthogonal 
$ and diagonal A, we can introduce Xi = <& T Xi to get E XjA/J = A and S n = 

<fr X^=i Xi%7^ ^* T - With this transformation, we can characterize the distribution 
of S n completely in terms of - YH=\ 

The next result we present is about the sample principal components. It is moti- 
vated by Proposition 2.13 and is originally due to Anderson [1]. 

Theorem 2.14. For n — > 00 and p fixed, let S±, S 2 , ■ ■ ■ , S n be a sequence of ran- 
dom symmetric p x p matrices with ^/nvec(S n — A) —> vec (G) , for a determin- 
istic A = diag (Ai, A2, . . • , A p ) having Ai > A2 > • • • > A p and a random symmet- 
ric matrix G. Let S n = U n L n U^ be the eigendecomposition of S n , with L n = 
diag (Z n ,i, l n , 2 , • • • , ln, P ) and l nA > l n>2 > • • •> l n , P - IfG = P (1), and the signs ofU n 
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are chosen so that U nyii > for 1 < i < p, then the elements of U n and L n converge 
jointly as 

(U n>ii -1)^0 for 1 < i < p, (2.9a) 

d G 

VnUn^j — > — - — ^— for 1 < i, j < p with % ^ j , and (2.9b) 
A, — Aj 

(l n ,i - \) ^ Gu forl<i<p. (2.9c) 



More generally, Anderson treats the case when the \ are not all unique. The key 
ingredient to Anderson's proof is a perturbation lemma, which we state and prove 
below. 



Lemma 2.15. For n — > oo and fixed p let S±, S2, ■ ■ • , S n G W xp be a sequence of 
symmetric matrices of the form 

1 ( 1 

S n = A + // +0[-= 



where A = diag (Ai, A 2 , . . . , X p ) with Ai > A 2 > • • • > X p and H n = 0(1). Let 
S n = U n L n U"^ be the eigendecomposition of S n , with L n = diag (l n ,i, l n ,2, • • • , ln,p) ■ 
Further suppose that U nj u > for 1 < i < p and l Ht i > > ■ • ■ > l n ,p- Then for all 
1 < h j < P and i ^ j we have 

U n , u = l + o (^-] , (2.10a) 



n f 
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Proof. Define p x p matrices E n , F n , and A n so that 



E n = diag (U n>11 , £4,22, • • • , U n , pp ) , (2.11) 
F n = Vn(U n -E n ), (2.12) 
A n = Vn(L n -A), (2.13) 



giving 



U n = E n + -^F n , and 

£ n = A + ^A n . 

sjn 



We have that 



1 ( 1 

5 n = A + // + o [-= 

— U n L n Un 

= E n AEl + ±= {E n A n El + F n AEl + E n AF^) + -M n (2.14) 



where the elements of M"„ are sums of O (p) terms, with each term a product of 
elements taken from E n , F n , A, and A n . Also, 



I P = u n u" n 

= E n El + -1= + E n Fl) + iw n , (2.15) 

where VF n = F n F^. From (2.15) we see that for 1 < i, j < p and i ^ j we must have 

\ = E 2 nii +-W nM , and (2.16a) 
n 

— E n ^iF n ji + F n jjE n jj 4~ —Wn t ij. (2.16b) 
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Substituting E„ u = 1 — ^W n;ii into equation (2.14), we get 

H nt u = E nt iiA n> iiE nt u + -J= (M n< u - XiW n>ii ) + o (1) , and (2.17a) 



#n,ij — '\jE„.jjl-'„.;j + \F niji E njii H — —M nji j + (1) . (2.17b) 

v ^ 

Equations (2.16a)-(2.17b) admit the solution 

E n , u = l + o (^=) , (2.18a) 



F n , tJ = --^- + o(l), and (2.18b) 

Aj — Aj 

A n)ii = + o (1) . (2.18c) 

This completes the proof. □ 

An application of the results of this section is the following theorem, which de- 
scribes the behavior of principal component analysis for large n and fixed p. 

Theorem 2.16. Let X±,X 2 , . . . ,X n be a sequence of iid J\f (//, S) random vectors 
in W, with sample mean X n and sample covariance S n . Let £ = $A$ T be the 
eigendecomposition of S, with A = diag (Ai, A 2 , . . . , A p ) and X 1 > A 2 > • • • > \ p > 0. 
Similarly, let S n = U n L n Un be the eigendecomposition of S n , likewise with L n = 
diag (/„,!, l nj2 , • • • , l n , P ), k,i > L,2 > ••• > l n , P , and signs chosen so that ($ T t/„) .. > 
for 1 < i < p. Then 

( i) L,i \ fori <i< p, and U n ^4' $ . 

(ii) After appropriate cenering and scaling, {l n ,i}^=i an d U n converge jointly in 
distribution and their limits are independent. For all 1 < i < p 

vM^-A^A/^O, 2A, 2 ), 

and 

(<!> T U n - I p ) 4 F, 
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where F is a skew-symmetric matrix with elements above the diagonal indepen- 
dent of each other and distributed as 

Fij ~ Af I 0, XAj 2 ) , foralll<i<j< p. 
\ (A - Aj-J y 

Proof. Part (i) is a restatement of Proposition 2.11. Part (ii) follows from Proposi- 
tion 2.13 and Theorem 2.15. □ 



2.3 Modern asymptotics 

We now present some results about sample covariance matrices when both the sample 
size, n, and the dimensionality, p, go to infinity. Specifically, most of these results 
suppose that n — > oo, p — > oo, and | — > 7 for a fixed constant 7 G (0, 00) . There 
is no widely-accepted name for 7, but we will sometimes adopt the terminology of 
Marcenko and Pastur [56] and call it the concentration. 

Most of the random matrix theory literature concerning sample covariance ma- 
trices is focused on eigenvalues. Given a sequence of sample covariance matrices 
Si, S2, ■ ■ ■ , S n , with S n G M. pxp and p = p(n) these results generally come in one of 
two forms. If we label the eigenvalues of S n as l n ,i,ln,2, ■ ■ ■ Jn,p, with l n> i > l n ^ > 
• • • > l n , P , then we can define a random measure 

i=l 

This measure represents a random draw from the set of eigenvalues of S n that puts 
equal weight on each eigenvalue. It is called the spectral measure of S n . Results 
about F Sn are generally called results about the "bulk" of the spectrum. 

The second major class of results is concerned with the behavior of the extreme 
eigenvalues l n> i and l n>p . Results of this type are called "edge" results. 
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2.3.1 The bulk of the spectrum 

To work in a setting where the dimensionality p, grows with the sample size, n, we 
introduce a triangular array of sample vectors. The sample covariance matrix S n is 
of dimension p x p and is formed from row n of a triangular array of independent 
random vectors, X n ^ X n>2 , X n>n . Specifically, S n = \ YTi=\ X n,i^ly We let X n 
be the n x p matrix X n = (x nl X n ^ ■ ■ ■ X n ^ , so that S n = ^X^X n . Most 
asymptotic results about sample covariance matrices are expressed in terms of X n 
rather than S n . For example, the next theorem about the spectral measure of a large 
covariance matrix is stated this way. 

Theorem 2.17. Let Xi, X 2 , ■ ■ ■ , X n be a sequence of random matrices of increasing 
dimension as n — > oo ; so that X n e ]R nxp and p = p(n). Define S n = ^X^X n . If 
the elements of X n are iid with K\X nt u — EX n) n| 2 = 1 and ^ — > 7 > 0, then the 
empirical spectral measure F Sn almost surely converges in distribution to a determin- 
istic probability measure. This measure, denoted F^ p , is called the Marcenko-Pastur 
Law. For 7 > 1 it has density 

^ MP(X) = 2^V ^ X '~ a ~<^ h ~< ~ X ^ a i< x < K (2-20) 

where a 7 = ^1 — -j^j and 6 7 = ^1 + -^j ■ When 7 < 1, there is an additional 
point-mass 0/(1 — 7) at the origin. 

Figure 2.1 shows the density f^ p (x) for different values of 7. The reason for choosing 
the name "concentration" to refer to 7 becomes apparent in that for larger values of 
7, F^ p becomes more and more concentrated around its mean. 

The limiting behavior of the empirical spectral measure of a sample covariance 
matrix was originally studied by Marcenko and Pastur [56] in 1967. Since then, 
several papers have refined these results, including Grenander and Silverstein [37], 
Wachter [92], Jonsson [47], Yin and Krishnaiah [98], Yin [96], Silverstein and Bai [82], 
and Silverstein [81]. These papers either proceed via a combinatorial argument in- 
volving the moments of the matrix elements, or else they employ a tool called the 
Stieltjes transform. Theorem 2.17 is a simplified version of Silverstein and Bai's main 
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Figure 2.1: The Marcenko-Pastur law. Density, f™ F (x), plotted against quan- 
tile, x, for concentration 7 = 0.25, 1, and 4. Concentration is equal to the number of 
samples per dimension. For 7 < 1, there is an addition point-mass of size (1 — 7) at 
x = 0. 



result, which more generally considers complex- valued random variables and allows 
the columns of X n to have heterogeneous variances. 

The meaning of the phrase u F Sn converges almost surely in distribution to i^ p " 
is that for all x which are continuity points of , 

F s n ( x) i*- f mp (x) (2 21) 

Equivalently, Theorem 2.17 can be stated as a strong law of large numbers. 



Corollary 2.18 (Wishart LLN). Let X n and {l n ,iYi=i be as m Theorem 2.17. Let 
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g : R — > R fre any continuous bounded function. Then 

-E^i)"" / sto^to- (2-22) 
^ i=i ^ 

Concerning convergence rates for the quantities in Theorem 2.17, Bai et al. [4] 
study the total variation distance between F Sn and F^ p . Under suitable conditions 
on X n>11 and 7, they show that ||F Sn - F^ P || TV = O p (n _2/5 ) and that with proba- 
bility one \\F S " - F 7 MP || TV = O (n~ 2 / 5+e ) for any e > 0. Guionnet and Zeitouni [39] 
give concentration of measure results. If X n> n satisfies a Poincare inequality and g is 
Lipschitz, they show that for S large enough, 

logP I J g(x) dF s -(x) - J g{x) dF™ p (x) > 5 j = O {5 2 ) , (2.23) 

with an explicit bound on the error. If one is willing to assume that the elements 
of X n are Gaussian, then Hiai and Petz [41] give an exact value for the quantity in 
(2.23). Guionnet [38] gives a survey of other large deviations results. 

It is interesting to look at the scaled behavior in equation (2.22) when the quan- 
tities are scaled by p. Indeed one can prove a Central Limit Theorem (CLT) for 
functionals of eigenvalues. 

Theorem 2.19 (Wishart CLT). Let Xi, X 2 , ■ ■ ■ , X n be a sequence of random n x p 
matrices with p = p{n). Assume that X n has iid elements and that E [X n ll ] = 0, 
E [X 2 n ] = I, and E [X^ n ] < 00. Define S n = ^X^X n and let F Sn be its spectral 
measure. If n — > oo ; ^ — > 7 and gi,g 2 , • •■ ,gk ar & real-valued functions analytic on 
the support of F^ p , then the sequence of random vectors 

p-(Jgi(x) dF s -(x) - J 9l (x) dF MP (x), 

J g 2 (x) dF s "(x) - J g 2 (x) dF MP (x), 

J g k (x) dF s -{x) - J g k (x) dF MP {x)) 
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is tight. Moreover, if E n] = 3 ; then the sequence converges in distribution to a 
multivariate normal with mean fi and covariance X, where 

* = ^<h)+9i(b,) _ l_ f hl g t (x) ^ 

4 2 7Tj a7 yj(x - a 7 )(6 7 - x) 



and 



E - = ~ff 9M 9j te) d d z 

2tt 2 J J ( m (z 1 )-m(z 2 )) dz! dz 2 



The integrals in (2.25) are contour integrals enclosing the support of Fj^ p , and 



m(z) = -(s + l-T-Wts-a.Xs-Zg (226) 
wit/i £/ie square root defined to have positive imaginary part when Qz > 0. 



The case when E [X* n ] = 3 is of particular interest because it arises when X njll ~ 
AT (0,1). 

This theorem was proved by Bai and Silverstein [6], and can be considered a 
generalization of the work by Johnsson [47]. In computing the variance integral (2.25), 
it is useful to know that m satisfies the identities 

m(z) = m(z), 

and 

1 7' 1 
m(z) m(z) + 1 

Bai and Silverstein show how to compute the limiting means and variances for g{x) = 
log a; and g{x) = x r . They also derive a similar CLT when the elements of X n are 
correlated. Pastur and Lytova [66] have recently relaxed some of the assumptions 
made by Bai and Silverstein. 
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2.3.2 The edges of the spectrum 

We now turn our attention to the extreme eigenvalues of a sample covariance matrix. 
It seems plausible that if F Sn F^ p , then the extreme eigenvalues of S n should 
converge to the edges of the support of F^ p . Indeed, under suitable assumptions, 
this is exactly what happens. For the largest eigenvalue, work on this problem started 
with Geman [34], and his assumptions were further weakened by Jonsson [48] and Sil- 
verstein [76]. Yin et al. [97] prove the result under the weakest possible conditions [7]. 

Theorem 2.20. Let X ly X 2 , ■ ■ ■ , X n be a sequence of random matrices of increasing 
dimension, with X n of size n x p, p = p{n), n — > oo ; and ^ — > 7 e (0, 00). Let 
S n = ^Xj t X n and denote its eigenvalues by l rh i > Z„ i2 > • • • > l n , P - If the elements 
of X n are iid with EX ni n = ; EX^ n = 1, and KX^ U < 00, then 

/„,i- (l + 7- 1/2 ) 2 - 

For the smallest eigenvalue, the first work was by Silverstein [78], who gives a result 
when X n> u ~ Af (0, 1). Bai and Yin [10] proved a theorem that mirrors Theorem 2.20. 

Theorem 2.21. Let X n , n, p, and {l n ,i}i=i be as in Theorem 2.20. IfEX* u < 00 
and 7 > 1, then 

in,- (1 - r 1 ' 2 ) 2 ■ 

With the same moment assumption on X nt u, ifO < 7 < 1, then 

l n ,p-n+l (l - 7~ 1/2 ) 2 • 

For the case when the elements of X n are correlated, Bai and Silverstein [5] give a 
general result that subsumes Theorems 2.20 and 2.21. 

After appropriate centering and scaling, the largest eigenvalue of a white Wishart 
matrix converges weakly to a random variable with known distribution. Johans- 
son [44] proved this statement and identified the limiting distribution for complex 
white Wishart matrices. Johnstone [45] later provided an analogous result for real 
matrices, which we state below. 
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Theorem 2.22. Let Xi,X 2 , ■ ■ ■ ,X n be a sequence of random matrices of increasing 
dimension, with X n e IR nxp 7 p = p(n), and n — > oo with - — * 7 e (0, 00). Define 
S n = ^Xj t X n and label its eigenvalues > Z n>2 > • • • > l n , P - If the elements of X n 
are iid with X n> u ~ A/"(0, 1) , then 

where 

Hn, P = \ 1/2 + Vp- 1/2) , 

/ \ 1/3 

and zs £/ie Tracy- Widom law of order 1. 

El Karoui [28] extended this result to apply when 7 = or 7 = 00. With appropriate 
modifications to ji n>p and <7„ )P , he later gave a convergence rate of order (n A p) 2 / 3 
for complex- valued data [29]. Ma [53] gave the analogous result for real- valued data. 
For correlated complex normals, El Karoui [30] derived a more general version of 
Theorem 2.22. 

The Tracy- Widom distribution, which appears in Theorem 2.22, was established 
to be the limiting distribution (after appropriate scaling) of the maximum eigenvalue 
from an n x n symmetric matrix with independent entries distributed as Af (0, 2) 
along the main diagonal and W(0, 1) otherwise [89] [90]. To describte F X TW , let q(x) 
solve the Painleve II equation 

q"(x) = xq(x) + 2q 3 (x), 

with boundary condition q(x) ~ Ai(x) as x — > 00 and Ai(x) the Airy function. Then 
it follows that 

F™(x) = exp |-1 J™ q(x) + (x - s)q\x) dx} . 
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Figure 2.2: The Tracy- Widom law. Limiting density of the largest eigenvalue 
from a white Wishart matrix after appropriate centering and scaling. 



Hastings and McLeod [40] study the tail behavior of q(x). Using their analysis, one 
can show (see, e.g. [70]) that for s — > — oo, 

i? W (*)~e Xp 

while for s — > oo, 

The density of F 1 TW is shown in Figure 2.2. 

A result like Theorem 2.22 holds true for the smallest eigenvalue. We define the 
Reflected Tracy-Widom Law to have distribution function Gj w (s) = 1 — F 1 TW (— s). 
Then we have 
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Theorem 2.23. With the same assumptions as in Theorem 2.22, if '7 G (l,oo) then 

W 1 ~ 



where 



*Vp = l (y/n-\/2-y/p-\/2) , 
= \ (y/n-l/2-y/p-l/2* 



1/3 



V^T72 - 1/2 



#7 G (0, 1), £/ien 



ln,p-n+l _^ yy ^ qTW 



a p,n 



We get the result for 7 G (0, 1) by reversing the role of n and p. Baker et al [13] proved 
the result for complex data. Paul [67] extended the result to real data when 7 — > 00. 
Ma [53] gives convergence rates. In practice, log/„ iP converges in distribution faster 
than l n>p . Ma recommends appropriate centering and scaling constants for log/ njP to 
converge in distribution to a G TW random variable at rate (n Ap) 2 ^ 3 . 

Theorem 2.23 does not apply when ^ — > 1. Edelman [26] derived the limiting 
distribution of the smallest eigenvalue when n = p. It is not known if his result holds 
more generally when - — > 1. 



Theorem 2.24. Let X n and {i n ,i}f=i be as in Theorem 2.22. If p(n) = n, then for 
t > 0, 

F{nl rhP <t}^ [ 
Jo 



i±^ e -(*/2+>/i) dx. 



In addition to the extreme eigenvalues, it is possible to study the joint distribution 
of top or bottom k sample eigenvalues for fixed k as n — > 00. In light of Theorem 2.17, 
for fixed k we must have that the top (respectively, bottom) sample eigenvalues 
converge almost surely to the same limit. Furthermore, Soshnikov [83] showed that 
applying the centering and scaling from Theorem 2.22 to the top k sample eigenvalues 
gives a specific limiting distribution. 
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It is natural to ask if the limiting eigenvalue distributions are specific to Wishart 
matrices, or if they apply to non-Gaussian data as well. There is compelling evidence 
that the Tracy-Widom law is universal. Soshnikov [83] extended Theorem 2.22 to 
more general X n under the assumption that X n ^\ is sub-Gaussian and n — p = 
0(p 1 ^ 3 ). Peche [69] later removed the restriction on n — p. Tao and Vu [87] showed 
that Theorem 2.24 applies for general X n:11 with EX Ujll = and EX^ U = 1. 

2.3.3 Eigenvectors 

Relatively little attention has been focused on the eigenvectors of sample covariance 
matrices. While many results are known, as of yet there is no complete character- 
ization of the eigenvectors from a general sample covariance matrix. Most of the 
difficulty in tackling the problem is that it is hard to describe convergence properties 
of U n , the p x p matrix of eigenvectors, when n and p go to infinity. The individual 
p 2 elements of U n do not converge in any meaningful way, so the challenge is to come 
up with relevant macroscopic characteristics of U n . 

Silverstein [74] was perhaps the first to study the eigenvectors of large-dimensional 
sample covariance matrices. He hypothesized that for sample covariance matrices of 
increasing dimension, the eigenvector matrix becomes more and more "Haar-like" . A 
random matrix U G W xp is said to be Haar-distributed over the orthogonal group 
if for every fixed p x p orthogonal matrix O, the rotated matrix OU has the same 
distribution as U. That is, OU = U. Silverstein's conjecture was that asn^ oo, U n 
behaves more and more like a Haar-distributed matrix. The next theorem displays 
one aspect of Haar-like behavior. 

To state the theorem, we need to define the extension of a scalar function g:Mi-> 
R to symmetric matrix arguments. If S = ULU T is the eigendecomposition of the 
symmetric matrix S G M. pxp , with L = diag(7i, / 2 , . . . , then we define 



g(S) = u(di ag (g(l 1 ),g(l 2 ),. 



With this notion, we can state Silverstein's result. 
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Theorem 2.25. Let Xi,X 2 , ■ ■ ■ ,X n be a sequence of random matrices of increasing 
dimension, with X n G IR nxp 7 p = p(n), and X n having iid elements with EX nj n = 0, 
EX^ n = 1, and RX* n < oo. Define S n = ^Xj t X n . Let ai,a 2 , . . . ,a n be a sequence 
of nonrandom unit vectors with a n in W and let g : R — > R be a continuous bounded 
function. If n — > oo and ^ — > 7 e (0, 00), £/ien 

alg{S n )a n a -5 J g{x)dF™ p {x). 

Silverstein [74] proves the result for convergence in probability and a specific class of 
X n . Bai et al. [3] strengthen the result to a larger class of X n and proves almost-sure 
convergence. They also consider dependence in X„. 

It may not be immediately obvious how Theorem 2.25 is related to eigenvectors. 
If S n = U n L n U n is the eigendecomposition of S n , with L n = diag (l n ,i,ln,2, ■ ■ ■ , ln,p) , 
then g(L n ) = diag (g(l n ,i),g(l n ,2), ■ ■ -,9(L, P )) and g(S n ) = U n g(L n )l/^. We let b n = 
U n a n . Then, 

p 

a T n g(S n )a n = J2bl t g(ln,). (2.27) 

i=i 

If U n is Haar-distributed, then b n will be distributed uniformly over the unit sphere in 
MP, and the average in (2.27) will put about weight - on each eigenvalue. If U n puts 
bias in any particular direction then the average will put extra weight on particular 
eigenvalues. 

Silverstein investigated second-order behavior of eigenvectors in [75], [77], [79], 
and [80]. He demonstrated that certain second-order behavior of U n depends in a 
crucial way on the fourth moment of X n:11 . This greatly restricts the class of X n for 
with the eigenvectors of S n are Haar-like. 



Theorem 2.26. Let X n , S n , and a n be as in Theorem 2.25. Suppose also that 
EX^ n = 3. Let gi, g 2 , . . . ,gu be real-valued functions analytic on the support of F^ p . 
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Then, the random vector 

VP ■ (al 9l (S n )a n - J 9l {x)dF^ p {x), 

alg2{S„)a n - / g 2 (x) dF^ p (x), 



a^g k (S n )a n - / g k (x)dF^ p (x) 



converges in distribution to a mean-zero multivariate normal with covariance between 
the ith and jth components equal to 

gi (x) 9j (x) dF^ p {x) - J 9i (x) dF^ p {x) ■ J 9j (x) dF^ p {x). 

Bai et al. [3] give a similar result for complex-valued and correlated X n . Silver- 
stein [79] showed that if 9 i(x) = x and g 2 {x) = x 2 , then the condition EX^ n — 3 is 
necessary for the random vector in Theorem 2.26 to converge in distribution for all 
a n . However, for the specific choice of a n = (^p, ■ ■ ■ ■> ? he later showed that 
the conclusions of Theorem 2.26 hold more generally when X n ll is symmetric and 
EX^ U < oo [80]. 
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Chapter 3 



Behavior of the SVD in low-rank 
plus noise models 

3.1 Introduction 

Many modern data sets involve simultaneous measurements of a large number of 
variables. Some financial applications, such as portfolio selection, involve looking 
at the market prices of hundreds or thousands of stocks and their evolution over 
time [57]. Microarray studies involve measuring the expression levels of thousands 
of genes simultaneously [73]. Text processing involves counting the appearances of 
thousands of words on thousands of documents [55]. In all of these applications, it is 
natural to suppose that even though the data are high-dimensional, their dynamics 
are driven by a relatively small number of latent factors. 

Under the hypothesis that one ore more latent factors explain the behavior of a 
data set, principal component analysis (PCA) [46] is a popular method for estimating 
these latent factors. When the dimensionality of the data is small relative to the sam- 
ple size, Anderson's 1963 paper [1] gives a complete treatment of how the procedure 
behaves. Unfortunately, his results do not apply when the sample size is comparable 
to the dimensionality. 

A further complication with many modern data sets is that it is no longer appro- 
priate to assume the observations are iid. Also, sometimes it is difficult to distinguish 
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between "observation" and "variable". We call such data "transposable" . A microar- 
ray study involving measurements of p genes for n different patients can be considered 
transposable: we can either treat each gene as a measurement of the patient, or we can 
treat each patient as a measurement of the gene. There are correlations both between 
the genes and between the patients, so in fact both interpretations are relevant [27]. 

One can study latent factor models in a transpose-agnostic way by considering 
generative models of the form X = UDV T + E. Here, X is the nxp observed data 
matrix. The unobserved row and column factors are given by U and V, respectively, 
matrices with k orthonormal columns each, and k <C min(n,p). The strengths of 
the factors are given in the k x k diagonal matrix D, and E is a matrix of noise. A 
natural estimator for the latent factor term UDV can be constructed by truncating 
the singular value decomposition (SVD) [36] of X. The goal of this paper is to study 
the behavior of the SVD when n and p both tend to infinity, with their ratio tending 
to a nonzero constant. 

In an upcoming paper, Onatski [64] gives a thorough treatment of latent factor 
models. He assumes that the elements of E are iid Gaussians, that y / n(?7 T t/ — /&) 
tends to a multivariate Gaussian distribution, and that V and D are both non- 
random. The contributions of this chapter are twofold. First, we work under a 
transpose-agnostic generative model that allows randomness in all three of U, D, 
and V. Second, we give a more complete picture of the almost-sure limits of the 
sample singular vectors, taking into account the signs of the dot products between 
the population and sample vectors. 

We describe the main results in Section 3.2. Sections 3.3-3.8 are dedicated to 
proving the two main theorems. Finaly, we discuss related work and extensions in 
Section 3.9. We owe a substantial debt to Onatski's work. Although most of the 
details below are different, the general outline and the main points of the argument 
are the same. 
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3.2 Assumptions, notation, and main results 

Here we make explicit what the model and assumptions are, and we present our main 
results. 

3.2.1 Assumptions and notation 

We will work sequences of matrices indexed by n, with 

X n = y^U n D n Vl + E n . (3.1) 

We denote by i/n(7 n D n y^ the "signal" part of the matrix X n and E n the "noise" 
part. We will often refer to U n and V n as the left and right factors of X n , and the 
matrix D n will be called the matrix of normalized factor strengths. The first two 
assumptions concern the signal part: 

Assumption 3.1. The factors U n and V n are random matrices of dimensions nx k 
and p x k, respectively, normalized so that U^U n = V^V n = I k . The number of 
factors, k, is a fixed constant. The aspect ratio satisfies ^ = 7 + f or a fixed 

nonzero constant 7 e (0, 00). 

Assumption 3.2. The matrix of factor strengths, D n , is of size k x k and diagonal, 
with D n = diag (d nA , d n , 2 , d njk ) and d njl > d H)2 > ■■■ > d n>k > 0. The matrix D n 
converges almost surely to a deterministic matrix D = diag(//^ 2 , /i^ 2 , • • • il^ 2 ) w ^ 
A*i > Ai 2 > ■ ■ ■ > fik > 0. Moreover, the vector ^(d^ - /ii, d 2 n l - /i 2 , . . . , d 2 n K - fi k ) 
converges in distribution to a mean-zero multivariate normal with covariance matrix 
S having entries Sjj = (possibly degenerate). 

The next assumption concerns the noise part: 

Assumption 3.3. The noise matrix E n is an n x p matrix with entries E n>i j inde- 
pendent Af(0, a 2 ) random variables, also independent ofU n , D n , and V n . 

For analyzing the SVD of X n , we need to introduce some more notation. We 

denote the columns of U n and V n by u n ^, u nt2 , • • • , M n ,fc and v n ,\iHn,ii ■ ■ ■ 1 V n ,k, respec- 
~ - - T 

tively. We let y / nU n D n V n be the singular value decomposition of X n truncated 
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to k terms, where D n = diag(/t^, Hn% ■ ■ ■ , H^lt) and the columns of U and V are 
given by u n ,i, Un,2 • • • , Un,k and v n>1 , £ n , 2 , • • • , £„,*,, respectively. 

3.2.2 Main results 

We are now in a position to say what the results are. There are two main theorems, one 
concerning the sample singular values and the other concerning the sample singular 
vectors. First we give the result about the singular values. 

Theorem 3.4. Under Assumptions 3.1 - 3.3, the vector pn = (p n ,i, ft>n,2, ■ ■ ■ , An,fc) 
converges almost surely to p — (pi, p2, ■ ■ ■ , pk), where 

_ J(/^ + - 2 )(l + ^) wkeniM>£, 

Hi — \ , x 2 l-J-^J 

ya 2 ( 1 + ^=J otherwise. 

Moveover, y/n(p — p) converges in distribution to a (possibly degenerate) multivariate 
normal with covariance matrix £ whose ij element is given by 



l 3 I 2 I I 2 . 



+ ^2 ( T 2 (2^ + (l + 7 - i ) ( r 2 )(l- 2 



*V = ^ + fi_._.9.n2 [ 9 .u, + h + ^ n*\ll- — \ ( 3 - 3 ) 



^ 4 w/ien n h Hj > 
otherwise. 

When an = 2/i 2 7 and Hi > the variance of the ith component simplifies to an = 

Next, we give the result for the singular vectors: 

Theorem 3.5. Suppose Assumptions 3.1 - 3.3 hold. Then the k x k matrix ® n = 
V^V n converges almost surely to a matrix = diag(#i, 9 2 , . . . , Ok), where 

„ ffl--4) + — V 1 whenHi>^, 
otherwise. 



3.2. ASSUMPTIONS, NOTATION, AND MAIN RESULTS 



29 



Also, the k x k matrix 
diag(</?i,</? 2 ,.. .,<Pk), where 

Moreover, Q i and (ft almost surely have the same sign. 
3.2.3 Notes 

Some discussion of the assumptions and the results are in order: 

1. Assumptions 3.1 and 3.2 are simpler than the assumptions given in many other 
papers while still being quite general. For example, Paul's "spiked" covariance 
model has data of the form 

X = ZH T + E, 

where S is an p x k matrix of factors and Z is an nx k matrix of factor loadings 
whose rows are iid multivariate jV(0, C) random variables for covariance matrx 

rp A 1/2 T 

C having eigen-decomposition C = QAQ . Letting Z = y/nPA Q be the 
SVD of Z, Anderson's results [1] give us that A and Q converge almost surely to 
A and Q, respectively, and that the diagonal elements of ^/n(A — A) converge to 
a mean-zero multivariate normal whenever C has no repeated eigenvalues. If we 
define U = P, D = A 1/2 , and V = HQ, then X = y/nUDV T + E, where the 
factors satisfy Assumptions 3.1 and 3.2. Dropping the normality assumption on 
the rows of Z poses no problem. Moreover, we can suppose instead of iid that 
the rows of Z are a martingale difference array with well-behaved low-order 
moments and still perform a transformation of the variables to get factors of 
the form we need for Theorems 3.4 and 3.5. 

2. There is a sign-indeterminancy in the sample and population singular vectors. 
We choose them arbitrarily. 



U^U n converges almost surely to a matrix $ = 



otherwise. 
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3. If instead of almost-sure convergence, D n converges in probability to D, then 
the theorems still hold with fi n , ©„, and <fr n converging in probability. 

4. The assumption that y/n {d\ — /ii, d\ — fi2, ■ ■ - d\ — fik) converges weakly is only 
necessary for determining second-order behavior; the first order results still 
hold without this assumption. If the limiting distribution of the vector of factor 
strengths y/n(dl — fj,i, d\ — yU 2 , . . . d\ — jik) is non-normal, one can still get at 
the second-order behavior of the SVD through a small adaptation of the proof. 

5. Most of the results in Theorems 3.4 and 3.5 can be gotten from Onatski's results 
[64]. However, Onatski does not show that y/n(fii — Jii) — > when /Zj is below 
the critical threshold. Furthermore, Onatski proves convergence in probability, 
not almost sure convergence. Lastly, Onatski does not get at the joint behavior 
between © n and $ n . 



3.3 Preliminaries 

Without loss of generality we will assume that a 2 = 1. Until Section 3.8 , we will also 
assume that 7 > 1. 



3.3.1 Change of basis 

A convenient choice of basis will make it easier to study the SVD of X n . Define 
U Ht i = U n , and choose U nj2 so that (lJ n ,i U n ^ is an orthogonal matrix. Similarly, 

put V n> i = V n and choose V n> 2 so that (v nj i V n ^ is orthogonal. If we define 
E n ,ij = Ul :i E n V nd and X n>ij = U"J i X„V raJ , then in block form, 




(v„,i vj) 



VnD n + E 

E n ,21 
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Because Gaussian white noise is orthogonally invariant, the matrices E n>i j are all 
independent with iid Af(0, 1) elements. Let 



£n,22 = (O n ,i O n>2 ) ^ 



(3.6) 



be the SVD of E n>22 , with A n = diag (A n> i, A„ j2 , . . . A njP _ fc ) . Note that E n22 E n)22 ~ 
Wp_ fc (n - k,Ip- k ) . Define 



//, o \ 
oj 



12 



^n,21 E rh22 



Ik o 
P n , 



(3.7) 



where E n>n — E n> u, E n ^ 2 — E ni i 2 P n , E ni21 — O^E^i, and 2£ n ,3i — 0^ 2 E n:3 i. 
Let U n D n V n be the SVD of X n , truncated to k terms. Lastly, put the left and right 
singular vectors in block form as 



U n>1 

fin* 



(3.8) 



and 



V n 



V n ,i 

Vn, 



(3.9) 



where r/ n ,i and V^i both x matrices. 

We have gotten to I n via an orthogonal change of basis applied to X n . By 
carefully choosing this basis, we have assured that: 

1. The blocks of X n are all independent. 



2. The elements of the matrices E n ^ are iid Af (0, 1). 
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3. The elements {n\ n> i, nA„ )2 , • . • ,nA„ iP _fc} are eigenvalues from a white Wishart 
matrix with n — k degrees of freedom. 

4. X n and X n have the same singular values. This implies that D n = D n . 

5. The left singular vectors of X n can be recovered from the left singular vectors 
of X n via multiplication by an orthogonal matrix. The same holds for the right 
singular vectors. 

6. The dot product matrix U n U n is equal to U n ,i- Similarly, V n V n = Vn,i- 
This simplified form of the problem makes it much easier to analyze the SVD of X n . 

3.4 The secular equation 

~ T ~ 

We set S n = \ = X n X n . The eigenvalues and eigenvectors of S n are the squares 
of the singular values of ~^X n and its right singular vectors, respectively. In block 
form, we have 

S n = I 5 "- ; ' Sn ' 12 I (3.10) 



where 



^71,21 Sn,22. 



S n ,u — D 2 n H — — [D n E ntU + Ej hll D n ) 

\ Tl 



+ ~ { E n,U E n,ll + El 2l E n ,21 + ^ 31 £! n ,3l) , 

/ V 



(3.11a) 



and 



S n , 12 = ^= (D n E n , 12 + El 21 K]l 2 ) + ^El M E n ^ (3.11b) 

S n , 21 = 4= (KuDn + K /2 E n ,2i) + -El 12 E nA1 , (3.11c) 

S n ,22 = A n H — E^ 12 E n>12 . (3. lid) 

Th 
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Now we study the eigendecomposition of S n . If v = ( 1 ) is an eigenvector of S n 
with eigenvalue /x, then 

S n ,u S n; i 2 \ I vA _ (vi 
S n}2 i S n ^ 2 ) \V 2 ) \V2 

If /x is not an eigenvalue of S nt2 2, then we get 

V2 = - (S n ,22 - /i/p-fc)" 1 S n ,2i v 1 , and (3.12a) 
(S„,n - /iJ fe - 5 n ,i2 (5„ >2 2 - nlp-k)' 1 S n ,2i) v\ = 0. (3.12b) 

Conversely, if (/x,t>i) is a pair that solves (3.12b) and v\ ^ 0, then 

V=[ Vl ) (3.13) 

is an eigenvector of S n with eigenvalue /x. 



We define 

T n {z) — S ni u — zlk — £71,12 (S n ,22 — zlp-k) 1 £n,2i (3-14) 
/ n (z,z) =T n (z)z, (3.15) 

and refer to f n (z, x) = as the secular equation. This terminology comes from numer- 
ical linear algebra, where a secular equation is analogous to a characteristic equation; 
it is an equation whose roots are eigenvalues of a matrix. Typically, secular equations 
arise in eigenvalue perturbation problems. The name comes from the fact that they 
originally appeared studying secular perturbations of planetary orbits. A more stan- 
dard use of the term "secular equation" would involve the equation detT n (z) = 0. 
However, for our purposes it is more convenient to work with f n . 
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3.4.1 Outline of the proof 

Almost surely S n and S n> 22 have no eigenvalues in common, so every eigenvalue- 
eigenvector pair of S n is a solution to the secular equation. To study these solutions, 
we first focus on T n (z). 

It turns out that when z > (1 + 7 -1 / 2 ) 2 , we can find a perturbation expansion for 
T n (z). In Section 3.5, we show that for z above this threshold, we can expand 

T n (z)=T (z) + ^=T nA (z), 



where T (z) is deterministic and T n ^{z) converges in distribution to a matrix- valued 
Gaussian process indexed by z. With this expansion, in Section 3.6 we study se- 
quences of solutions to the equation f n {z n ,x n ) = 0. Using a Taylor series expansion 
for z n > (1 + 7 -1 / 2 ) 2 , we write 

1 ( 1 

Z n = Zq + —^=Zn t i + P [—= 



1 ( 1 

x n = x + —j=x n ^ + Op —= 

in \ \ n 



where (z°,x°) is the limit of (z n ,x n ) as n — > oo and (z Uj i, is the order-^ ap- 
proximation error. 

In Section 3.7 we get the singular values and singular vectors of -^X n . From 
every solution pair (z n ,x n ) satisfying f n (z n ,x n ) = 0, we can construct an eigenvalue 
and an eigenvector of S n using equation (3.13). The eigenvalues are squares of sin- 
gular values of -i=X m and the eigenvectors are right singular vectors. We can get 
the corresponding left singular vectors by multiplying the right singular vectors by 
X n and scaling appropriately. For z values above the critical threshold, we use the 
perturbation results of the previous two sections. Below the threshold, we use a more 
direct approach involving the fluctuations of the top eigenvalues of A n . 

Finally, in Section 3.8 we show that the results still hold when 7 < 1. For parts 
of the proof, we will need some limit theorems for weighted sums from Appendix B. 
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3.5 Analysis of the secular equation 

We devote this section to finding a simplified formula for T n (z) for certain values of 
z. By a bit of algebra and analysis, we find the first- and second-order behavior of 
the secular equation. 

First, we employ the Sherman-Morrison- Woodbury formula [36] to get an expres- 
sion for (S rh 22 — zl p -k)~ l ■ As a reminder, this identity states that for matrices A, 
B, and C with compatible dimensions, the following formula for the inverse holds: 

(A + BCY 1 = A' 1 - A l B (I + CA^B)' 1 CA \ 

Using this, we can write 

= (A n — zlp-k) 

(A n — zl p -k) 

' E n,12 I Ik + ~ En > 12 (-^-« ~ z Ip-k) E n,12 j E n,l2 

• (A n — zlp-k) 1 • 



(3.16) 



' n,21 



Next, we define D n = D n + A=E n n, so that 

^n,12 ^ra,22 ~ Zlp^f^j S 

= \ (D T n E n , 12 + El 2l XT) 

■ {S n ,22 — Zl p ^k) 

■ (El 12 D n + \]{ 2 E n 



(3.17) 



There are three important terms coming out of equations (3.16) and (3.17) that 
involve A n . These are E n>12 (A n - zl p _ fe ) _1 E^ 12 , E^ 21 (A n - zl v _ k y x E Hj21 , and 
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E n ,i2 (A„ — zl p _ k ) 1 Al/ 2 E ni 2i. Each term can be written as a weighted sum of outer 
products. 

There is dependence between the weights, but the outer products are iid. For 
example, with E n>12 = (s n ,i2,i £n,i2,2 • • • £„,i 2> p-k) , we have 

p-k 

E n ,12 (A n - Zlp- k ) 1 £^ 12 , = ^ T — • £'n,12,a £n,l2,a- 

From the Central Limit Theorem and the Strong Law of Large Numbers we know 

that ^ Er=5 ^n,12,a #n,12,a ^ J k alld that ~ ^ Ea=5 E n,\2,oc £% >12>a ~ Ik) 

converges in distribution to mean-zero symmetric matrix whose elements are jointly 
multivariate normal. In the limiting distribution, the elements have variance 2 along 
the diagonal and variance 1 off of it; aside from the matrix being symmetric, the 
unique elements are all uncorrelated. The difficulty in analyzing these terms comes 
from the dependence between the weights. 

When z is in the support of F^ p , the weights behave erratically, but other- 
wise they have some nice properties. First of all, A n is independent of E n ^ 2 and 
E n>2 i. Secondly, the Wishart LLN (Corollary 2.18) and Theorem 2.20 ensure that 
for z > (1 +7" 1/2 ) 2 , ^EriiA^ ^' lt3^ F 7 MP W- Moreover, the Wishart 
CLT (Theorem 2.19) guarantees that the error is of size Op(-). These properties in 
combination with the limit theorems for weighted sums in Appendix B allow us to 
get the behavior of E nj i2 (A n — zl^y x E^ 12 and its cousins. 



The function 



m{z) = J ^r/F^it) 



1 + 7- 1 ) + ^J{z- 6 7 ) (z - a 7 ) 



= 7 g (3.18) 



is the Stieltjes transform of F^ , where a 7 = (l — 7 x / 2 ) 2 and 5 7 = (l +7 1//2 ) z . 
When restricted to the complement of the support of F'jf 9 , m has a well-defined 
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inverse 

z(m) = -- + - 1 —— . (3.19) 

m l + 7- 1 m 

Also, m is strictly increasing and convex outside the support of F^ p . This function 
appears frequently in the remainder of the chapter. 

Lemma 3.6. If z > & 7 , then 

X -E n ^ (A n - zI^Y 1 El l2 a -4' 7 " 1 m(z) • I k , (3.20a) 
-El 21 (A„ - zVfe)" 1 £ n , 21 a -4' 7 - x m(^) • I k , (3.20b) 



and 

-£ n>12 (A n - zlp-k)- 1 K /2 E n ,2i ^ 0. (3.20c) 

77/ 

Proof. We prove the result for the first quantity and the other derivations are analo- 
gous. For each 1 < i, j < k, we have that 

(I i t \ p-k 1 ^ CgrMgL (^,12)^ 

U B ' 12 ( ^ " y " ~ ' ^ ^ • 

Let N — p — k, define weights W NtCl = (X^a-z)' 1 , and let Y N:OC = (E nA2 ) ia (E nA2 ) ja . 

9(t) = 



The function 

, — otherwise, 



is bounded and continuous. Moreover, since A„ 5 i ^4' 6 7 , with probability one g(\ n , a ) 
is eventually equal to Wjv, Q f° r a ^ a - The Wishart LLN (Corollary 2.18) gives us 
that h El=i W NiCt a -5 J ^- z dF^ p (t) = m(z). Since |Wjv iQ | < W NA <' 6 7 , the fourth 
moments of the weights are uniformly bounded in N. The Fat jQ , are all iid with EY/v,» = 
5ij and EY^ Q < 00. Applying these results, the weighed SLLN (Proposition B.2) gives 
us that X^a=i ^N,cXN,a m{z)5ij. Since £^ — > 7 _1 , this completes the proof. □ 

Lemma 3.7. Considered as functions of z, the quantities in Lemma 3.6 and their 
derivatives converge uniformly over any closed interval [u,v] C (6 7 ,oo). 
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Proof. We show this for the first quantity and the other proofs are similar. Defining 
e n ,ij(z) = (-E nA2 (A n - zlp-k) 1 E^i 2 j , 

\ n / ij 

we will show that for all with 1 < i, j < k, sup 2g ^] \e n ^j(z) — 'j~ 1 m(z)5 i j\ ^4 0. 

Let e > be arbitrary. Note that for z > 6 7 , m'(z) > and m"(z) < 0. We let 
d = m'(u) and choose a grid u = z 1 < z 2 < ■ ■ ■ < zm-i < z m = v, with \z\ — z\ + \\ = Ik. 
Then \ r y~ 1 m(zi) — r y~ 1 m(zi + i)\ < |. From Lemma 3.6, we can find N large enough 
such that for n > N, maxj e {i v .. j M} \e n ,ij(zi) — l~ lm ( z i)^ij\ < f ■ Also guarantee that iV 
is large enough so that A„ ; i < it (this is possible since A n ,i ^4 6 7 < u). Let z G [ii, i>] 
be arbitrary and find / such that z\ < z < z t+1 . Observe that e n>i j(z) is monotone 
for z > ^,,1. Thus, either e njij -(2,) < e n>i j(z) < e njij (z l+1 ) or e n4j (z t+1 ) < e n ^(z) < 

If i ^ j, we have for n > JV, — | < e n;i j(z) < |. Otherwise, when i = j, e n>i j(z) 
is monotonically increasing and 7 -1 m(^) — | < e n ^j(z) < 7~ 1 m(^ + i) + |, so that 
7 _1 m(2;) — £ < e„ 5 ij(z) < j~ 1 m(z) + £. In either case, \e n jj(z) — r y~ 1 m(z)5ij\ < e. 
Since ^ = ~ {\-z) 2 ' w hich is monotone for z > A, the same argument applies 

to show that the derivatives converge uniformly. □ 



Lemma 3.8. If z±, z 2 , ■ ■ ■ , z\ > 6 7 , then jointly for z e {z±, z 2 , ■ ■ ■ , z{\ 

Vn Q^n,i2 (An - zIp-kT 1 < 12 - 7" 1 ™ Wfc) = F n (z) ^ F(z), (3.21a) 

^^,12 (An " zip-*)"* K /2 En,2?j = G n (z) ^ G(z), (3.21b) 

Q^n,2i (An - zl^y 1 E n , 21 - 7 - 1 m(z)I j ^ = H n {z) A H(z), (3.21c) 
where the elements of F(z), G(z), and H(z) jointly define a multivariate Gaussian 
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process indexed by z, with each matrix independent of the others. The nonzero co- 
variances are defined by 



Cov (Fifa), Fifoj) = 7" 1 (1 + <y 
Cov (Gijiz!), G ij (z 2 )'j = 7 



m(z 1 ) — m(z 2 ) 



z l — z 2 

1 z\m{z\) - z 2 m(z 2 ) 



cov (h^), Hatei) = i- 1 (i + <y 



zi - z 2 

m(zi) — m(z 2 ) 



Zl - z 2 



with the interpretation when z\ = z 2 that 



(3.22a) 
(3.22b) 
(3.22c) 



m(z 1 )-m(z 2 ) ,, , , z 1 m(z 1 ) - z 2 m(z 2 ) , . , 
— m(zi), and = m(z 1 ) + z\ m \z\). 



z i — z 2 



z i — z 2 



Proof. We will need the Wishart CLT (Theorem 2.19) and the strong weighted mul- 
tivariate CLT (Corollary B.5). To save space we only give the argument for the joint 
distribution of F(zi) and F(z 2 ). 

Put N = p — k and consider the 2/c 2 -dimensional vector Yjv a = { ~„ N ' a j , where 
Yn ,q — vgc (£'n,i2,a En,i2,a) an d E n \ 2 — \ Fi n \ 2 \ E n \ 2 2 ■ ■ ■ £'n,i2,Ar ) • Define the 
2A; 2 -dimensional weight vector Wn,o — I iY ' Q ' 1 J , where WN, a ,t — A \ z . l- We have 

\WN,a,2 J 

that for a — 1, 2, . . . , N, the Yn,o. are iid with 



E [Y N ,i\ = v Y =' ' 



and p, = vec (Ik)- Also, we have 



E 



(Yn,i - V ¥ ) (Yn,i - V ¥ ) 



= S Y = 



where £ 



E 



VeC (£n,12,l£n,12,l - J fc )J ( vec (^n.12,1^,12,1 ~ 1 



is a /c 2 x /c 2 
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matrix defined by the relation 



E 



(i?n,12,li?n,12,l ~ 1 k) ^ (^,12,1^,12,1 ~ Ik )i'j' 



(ij)=(i'j') + S (i,j)=U',i')- 



That is, the diagonal elements of E n> i2 t i£f£ 12 i — Jfc have variance 2 and the off- 
diagonal elements have variance 1. Aside from the matrix being symmetric, the 
unique elements are all uncorrelated. 



Letting 



Hi = ■ = rn{Zi), and 



t — Z. 



m(zi) — m(zj) 



(t-Zi)(t-Zj) Zi-Z 



3 



the Wishart LLN combined with the truncation argument of Lemma 3.6 gives us that 
J_ r W l 11- „W anf ] J_ V w I a A a w Thus 



Moreover, the Wishart CLT tells us that the error in each of the sums is of size 



As in Lemma 3.6, the fourth moments of Wn,o and Yn,ol are all well-behaved. 
Finally, we can invoke the strong weighted CLT (Corollary B.5) to get that the 
weighted sum v^/V Y^l=i Wn,o • Yn,o — • /-t T ) converges in distribution to a 
mean- zero multivariate normal with covariance 
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This completes the proof since 



.1 - n 



p — 

(with = v7D, and - 7~ 1/2 - □ 

Remark 3.9. It is possible to show that the sequences in Lemma 3.8 are tight by 
an argument similar to the one used in [64]- This implies that the convergence is 
uniform in z. For our purposes, we only need that the finite- dimensional distributions 
converge. 



With Lemmas 3.6-3.8, we can get a perturbation expansion of T n (z) for z > h 



Lemma 3.10. If [u,v\ C (fe 7 , oo) ; then T n (z) ^5 T (z) uniformly on [u,v], where 



To(z) = - - _\ , , D 2 + -^—I k . (3.23) 
1 + 7 1 m(z) m(z) 



Lemma 3.11. If z > b~, then 



T n (z) = T (z) + -^T n)1 (z), (3.24) 



where 



T n , l (z)=-(l+ 1 - 1 m(z)) ■ DF n (z)D 
+ (l + 7 - 1 m(z)) _1 

■{^(D 2 n - D 2 ) + D(E niU - G n (z)) + (E n , n - G n (z)) T d} 

- z H n {z) + Q< 3 i^n,3i - (1 - 7^)^ + op (1) . (3.25) 
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Proof. First, we have 



S n , n =D 2 n + I k + ^= {DE nA1 + El u D) 



Next, we compute 



-1 



= (l k + l- l m{z)I k + -^=F n {z)^j 

= (1 + 1 - 1 m(z)y 1 I k - 4= (1 + 7 _1 m(^))" 2 F n (z) + op (-^) 

v n V v n / 

Using this, after a substantial amount of algebra we get 



= zm (z )I k + ' r D 
1 + 7 i m(z) 



+ A=\ ~, V^t (DG n (z) + Gl(z)D) 

y/n 1 1 + -f- 1 m(z) v w nV 7 / 



+ 



+ l -El 2l E n ^ 1 + oJ^= 



The equations for T and T n l follow. To simplify the form of T , we use the identity 

^(l+T-^^-^+a-T- 1 )- □ 
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3.6 Solutions to the secular equation 

We will now study the solutions to f n (z, x) = 0, denned in equation (3.15). If (z, x) is 
a solution, then so is (z,ax) for any scalar a. We restrict our attention to solutions 
with ||x|| 2 = 1. We also impose a restriction on the sign of x, namely we require that 
the component with the largest magnitude is positive, i.e. 

maxxj = max \xi\. (3.26) 

i i 

3.6.1 Almost sure limits 

First we look at the solutions of f (z,x) = T (z)x, the limit of f n (z,x) for z > 6 7 . 

Lemma 3.12. Letting k = max{i : ^ > r )~ 1 ^ 2 }, if fii, fx 2 , ■ ■ ■ , are all distinct, then 
there are exactly k solutions to the equation fo(z,x) = 0. They are given by 

[P>i, Si), i — 1, • • • , k, 

where pi is the unique solution 

m(pi) 

and Qi is the ith standard basis vector. 



1 



Proof. We have that 

T (z) = , D 2 + ^r-Jk- 

1 + 7 1 m{z) m{z) 

Since this is diagonal, the equation f (z,x) = holds iff the ith diagonal element of 
Tq(z) is zero and x — e^. The ith diagonal is zero when 
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equivalently 

m(z) = -. 

Note that m(z) > — (7 -1 ^ 2 + 7" 1 ) 1 and m'(z) > on (fe 7 ,oo). Hence, a unique 
solution exists exactly when ^ > □ 

Given a solution of fo(z,x) = 0, it is not hard to believe that there is a sequence 
of solutions (z n ,x n ) such that f n (z n ,x n ) = 0, with z n and x n converging to z and x, 
respectively. We dedicate the rest of this section to making this statement precise. 

Lemma 3.13. If fi > 7~ 1 / 2 occurs I times on the diagonal of D 2 , then with probability 
one there exist sequences z n l , . . . , z n j such that for n large enough: 

1- z n ,i ^ z n ,j fori ^ j 

2. detT n (z n>i ) = fori = 1,...,/. 

3. z n>i -> z = m' 1 (^j^r) ■ 

The proof involves a technical lemma, which we state and prove now. 

Lemma 3.14. Let g n (z) be a sequence of continuous real-valued functions that con- 
verge uniformly on (u,v) to go(z). If go(z) is analytic on (u,v), then for n large 
enough, g n {z) and go(z) have the same number of zeros in (u,v) (counting multiplic- 
ity). 

Proof. Since IgoWl is bounded away from zero outside the neighborhoods of its zeros, 
we can assume without loss of generality that go(z) has a single zero z e (u,v) of 
multiplicity I. Define g n (z) = u^ffi-i ■ The function go(z) is bounded and continuous, 
and has a simple zero at z . For r small enough, go(z + r) and g (z Q — r) have differing 
signs. Without loss of generality, say that the first is positive. 

The sequence g n {z) converges uniformly to go(z) outside of a neighborhood of z . 
Thus, for n large enough, g n (z — r) < and g n [z + r) > 0. Also, either g n (z) has a 
zero at z , or else for a small enough neighborhood around Zq, sgn g n (z) is constant. 
Since g n (z) is continuous outside of a neighborhood of zq, g n {z) and g n (z) must have 
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a zero in (z — r, z + r) C (u,v). Call this zero z Ujl . Since r is arbitrary, z n> i — * z , 
and g " ( - 2 - > >■ uniformly on (u, v). We can now proceed inductively, since 

Z — Z n ,l Z—ZO V ' / 2 — 20 

has a zero of multiplicity Z — 1 at zq. □ 

Proof of Lemma 3.13. Let be the unique solution of m(zo) = ([/, + 7" 1 ) -1 . Define 
9n(z) — detT n (z). Since det is a continuous function, g n (z) converges uniformly to 
go(z) in any neighborhood of z . Noting that go(z) has a zero of multiplicity I at z , 
by Lemma 3.14 we get that for large enough n, g n (z) has / zeros in a neighborhood 
of z . By a lemma of Okamoto [63], the zeros of g n (z) are almost surely simple. □ 

The last thing we need to do is show that for each sequence z n ^ solving the 
equation detT n (z nti ) = 0, there is a corresponding sequence of vectors x n ^ with 
fn(z n ,i, $n,i) = 0. Since detT n (z nji ) = 0, there exists an x n>i with T n (z n ^x nti = 0. 
We need to show that the sequence of vectors has a limit. 

Every solution pair [z n ^ x nt i) determines a unique eigenvalue-eigenvector pair 
through equation (3.13). Since the eigenvalues of S n are almost surely unique, 
with the identifiability restriction of (3.26) we must have that z n>i uniquely deter- 
mines x Uj i. Suppose that z n ^ is the ith largest solution of detT n (z) = 0, and that 
fn(z n ,i,Zn,i) = 0. We will now show that x n ^ ^4 e^. 

Lemma 3.15. Suppose that f n (z n ,i,Xn,i) — 0, that x n ^ satisfies the identifiability 
restriction (3.26), and that z n>i fa. If fa ^ fij for all j ^ i, then x n ^ ^4 gj. 

We will use a perturbation lemma, which follows from the sinO theorem (see 
Stewart and Sun [85] [p. 258]). 

Lemma 3.16. Let (z,x) be an approximate eigenpair of the k x k matrix A (in the 
sense that Ax m zx), with \\x\\ 2 = 1. Let r = Ax — zx. Suppose that there is a set C 
of k — 1 eigenvalues of A such that 

5 = min \l — z\> 0. 

lec 

Then there is an eigenpair (z ,x ) of A with H^olh = 1 satisfying 
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and 

\z - Zq\ < \\r\\ 2 . 

Proof of Lemma 3.15. We have that z nji —> \±i and that T n (z nyi )x nyi = 0. Since 
T n (zn ti ) ^ T (p,i), we get 

||T'o(A*i)2 ; n,i||2 — || {T n (z n j — T (/ii)) X n> i || 2 

< ||T n (^-T (^)|| F 

no. 

Since /!« is distinct, is a simple eigenvalue of T (jj,i). Thus, all other eigenvalues 
have magnitude at least 8 > for some 5. Define r n = T (//i)x nj j. By Lemma 3.16, 
there exists an eigenpair (2o,3?o) of T (j2i) with |z | < ||r n ||2 and x^Xq > \j 1 — . 
Since ||r n || n 0, for n large enough we must have ^ = and = Si or — e«. Lastly, 
noting that x and x nj j are both unit vectors, we get 

£n,i 2-0 1| 2 = 2 2x n j Xo 

no. 

With the identifiability restriction, this forces x n ,i Si- □ 

Finally, we show that eventually the points described in Lemma 3.13 are the only 
zeros of f n (z,x) having z > b T Since f n (z,x) = implies detT n (^) = 0, it suffices 
to show that T n (z) has no other zeros. 

Lemma 3.17. For n large enough, almost surely the equation detT n (z) = has 
exactly k solutions in (6 7 ,oo) (namely, the k points described in Lemma 3.13). 

Proof. By Lemma 3.14, for n large enough detT n (z) and detT (z) have the same 
number of solutions in (u, v) C (6 7 , oo) . Thus, we only need to show that the solutions 
of det T n (z) are bounded. Since every solution is an eigenvalue of S n , this amounts to 
showing that the eigenvalues of S n are bounded. Using the Courant-Fischer min-max 
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characterization of eigenvalues [36], we have 



1 

1 

. n ' 

- , 2 

a.s. 



< ( \\U n D n Vl\\ 2 + ^\\E nll , 



Thus, the solutions of det T n (z) = are almost surely bounded. □ 



3.6.2 Second-order behavior 

To find the second-order behavior of the solutions to the secular equation, we use 
a Taylor-series expansion of f n (z,x) around the limit points. That is, if (z n ,x n ) — > 
(z ,x ), we let Df n be the derivative of f n and write 



= f n (Zn, S n ) W f n (z , X ) + Df n (z , X ) 



v2-n 2-0 . 



We now want to solve for z n — z and x n — x . Without the identifiability constraint, 
there are k equations and k + 1 unknowns, but as soon as we impose the condition 
II %n || 2 = ||^o II 2 = 1) the system becomes well-determined. 

To make this precise, we first compute 

Df n (z,x) = (t' (z)x T (z)) + O p , (3.27) 

with pointwise convergence in z. Then, we write 

f n {z n ,X n ) = f n (z ,X ) + Df n (z ,X ) ( ™ J + Op ((z n - Z Q f + \\x n - X \\l) . 

\x n -x l 
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If f n (z n ,x n ) = and f (z ,x ) = 0, then we get 



= -j=f n ,i(zo,x ) + Df n (z ,x ) ( Zn Z ° ] + P ((z n - z ) 2 + \\x n - 2T |||) 
v n \x n -x 0j 



If (z n , x n ) (zo, Xo), then the differences z n — z and x n — x must be of size Op 

and the error term in the Taylor expanson is size Op (-). The final simplification we 

can make is from the length constraint on x n and x . We have 



1 x n x n 



= (x + (x n - x )) (x + (x n - x )) 
= 1 + 2xq (x n — Xo) + \\x n — Xo 111? 

so that Xq (x n — Xo) = Op (^). With a little more effort, we can solve for z n — z and 

•En •?()• 

Lemma 3.18. If(z n ,x n ) is a sequence converging almost surely to (fii,ei), such that 
fn(z n , x n ) = and ^ ^ jij for i ^ j, then: 



MZn ~ fii) = ~ ^r^) U + O^ 1 )' ( 3 - 28 ) 

(ii) 

^/n (x nji - 1) = op (1) , and (3.29) 

(Hi) 

{T n ,i{Ui)) ^ . . 

V« nj = — 77= >. — + o P (l) for ? f j. (3.30) 

Proof. We have done most of the work in the exposition above. In particular, we 
already know that (ii) holds. Using the Taylor expansion, we have 

= ^T„ lfe ) & + (r Mb r„(ft)) r - + op (^) . 
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Since T and T' are diagonal, using (ii) we get 

= ^=(T n , 1 (p i )) u + (T' ^)) tt (z n - fr) + op , 

= -±= (T,,,!^)).. + (ToC/ii)) . . x nJ + o P for % + j. 

Therefore, (i) and (Hi) follow. □ 

At this point, it behooves us to do some simplification. For Hi, fij > 7 -1 ^ 2 , we 
have 

m(jXi) = 



IH + l 1 
Using (3.19), this implies 

-1 7" 1 
Hi = ^ + 1 + 7 H 

+ 7" 1 



(* + l) 



Note that this agrees with the definition of pi in Theorem 3.4. Now, 

m(jk) ~ mjfij) = 1 fijfij 

Pi - Pj (Hi + 7 _1 )(^i + 7" 1 ) fiifij ~ 7 _1 ' 

so that 
Also, 

pijn(pi) - pj m(p j ) _ 1 

We can compute 



(T Q (ni)).. = - hj) ^ for j ^ i, 
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Since (l + 7 1 m(/i i )) 1 = ^ 7 - , we get 

TnM) = (7^^-31^,31 - v^(l - 7~ 1 )^) 



Hi 



DF n (p,i)D 



+ (^77^) {^( D l ~ D) +D(E ntll - G n (jM)) 

+ (£ n , n - G^)) V) 

The first term converges in distribution to a mean-zero multivariate normal with 
variance 2 (1 — 7 _1 ) along the diagonal and variance 1 — 7" 1 otherwise; the elements 
are all uncorrelated except for the obvious symmetry. Also, we have 



Cov (Gij(ni), Gijifia)) = 7 1 • - 

- 7 



v (hi + 7" 1 )(/ i 2 + 7 x ) Wz~l 
1 



C.v (//„(//, ).//„(//,)) - '(1 • <y ' /,J/ ' 2 



+7 _1 )(/^2 + 7" 1 ) ^2-7" r 
Therefore, for j ^ i we have variances 

' Hi + 7 _1X 2 



Var ((T nil (/ii)ei) .) = a* 



+ 2(2^ + l + 7- 1 ) ^±^ + (l), 



(3.31a) 



Vax((T nil (/Z i )e i ) j ) = (2^ + 1 + 7" 1 ) ^2,^-1 + ( 3 - 31b ) 
and nontrivial covariances 

Cov ((iVOiOft)., (T^yfi,-)) = a l3 • • + 0(1) (3.32a) 
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and 



COV ((T^i^ft)., {Tn^jlj) Sj ) t ) = (m + fij + 1 + 7 ') 



. (/Xi + 7- 1 )(^- + 7- 1 ) + o(1) (3 - 32b) 
W - 7" 1 

All other covariances between the elements of T nj i(/2j)ej and T njl (p,j)ej are zero. 



3.7 Singular values and singular vectors 

The results about solutions to the secular equation translate directly to results about 
the singular values and right singular vectors of A=X n . 



3.7.1 Singular values 

Every value z with f n (z, x) — for some x is the square of a singular value of -^X n . 
Therefore, Section 3.6 describes the behavior of the top k singular values. To complete 
the proof of Theorem 3.4 for 7 > 1, we only need to describe what happens to the 
singular values corresponding to the indices i with /ij < 7 -1 / 2 . 

Lemma 3.19. If Hi < 7 -1 / 2 then the ith eigenvalue of S n , converges almost surely 
to b 1 . 

Proof. From Lemma 3.17, we know that for n large enough and e small, there are 
exactly k = max{i : ^ > 7 -1 ^ 2 } eigenvalues of S n in (b-y + e, 00). From the eigenvalue 
interleaving inequalities, we know that the ith eigenvalue of S n is at least as big as 
the ith eigenvalue of S n ^2- 

Denote by fi n>i the ith eigenvalue of S n , with k < i < k. Then almost surely, 

lim X n i < lim p, n i < lim fi n i < 6 7 + e. 

n->oc n-t-oo 

Since \ Hii ^4 6 7 and e is arbitrary, this forces fi n ^ ^4 6 7 . □ 
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Lemma 3.20. If fa < 7 1 I 2 , then \/n(fa — fa) 0, where fa is the ith eigenvalue of 



Proof. We use the same notation as in Lemma 3.19. Recall that in the present 
situation, fa = b 7 . Since = 6 7 + Op(n~ 2 / 3 ), we have that 

Vn{K,i ~ & 7 ) 0. 

This means that 

lim y/n(jx n ^ - 6 7 ) > o P (l). 

n— +00 

The upper bound is a little more delicate. By the Courant-Fischer min-max charac- 
terization, fij 2 is bounded above by j^J 2 , the ith. singular value of 

±=X n + v^((7- 1/2 + ef' 2 - »T)u n ^ 
for any e > 0. From the work in Section 3.6, we know that 

^ = ( 1 + ^ 1/2 + e )( 1 + y7^) +0 >(^ 



\J r ryl/2 £ 



where 



of = an ( 1 - 



l + 2 7 i/% + 7£ 2^ 

+2(% -./ i+£ ) +1+T -.)( 1 - _l_ ) 

= 0(e). 



Therefore, for all < e < 1, we have 



v^(^-6 7 -s 2 ^^)<Op( £ 1/2 ). 
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Letting e — > 0, we get 

lim y/n(fi n:i - by) < o P (l). 

n— >oo 

Together with the lower bound, this implies y/n(£i ni — 6 7 J — > 0. □ 



3.7.2 Right singular vectors 

For Hi > 7 -1 / 2 , we can get a right singular vector of -^X n from the sequence of 
solution pairs {z n ^x n ^ satisfiying f n (z n ,i,S n ,i) = and z n>i ^4 The vector is 
parallel to 

Xn,i= ^ • (3.33) 

We just need to normalize this vector to have unit length. The length of x n is given 
by 

1 1 ^tt. 1 1 2 = {Ik + Sn,12(Sn,22 ~ z rJ-p-k) ^n^l) ¥n- (3.34) 

It is straightforward to show that for z > 6 7 , 

Ik + ^,12(^,22 _ Zlp-k) 2 Sn,21 ~T'o( z ) 

T l m\z) 2 m'(z) 

>U + r^J-k 



(l + 7 1 m(z)) {m(z))' 

uniformly for z in any compact subset of (& 7 , oo). It is also not hard to compute for 
pLi > 7~ 1/2 that 

-T' Q {jli) = --f^D 2 + -T^-ZT 1 *- 
Therefore, if > 7~ 1/2 and (2^,2^) ^4 then 

|2 a.,. + 7 _1 ) 



X 



7Mi 



The behavior of the right singular vectors when ^ < 7 1 / 2 is a little more difficult to 
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get at. We will use a variant of an argument from Paul [68] to show that ||x|| 2 oo, 
which implies that p"'.|| 2 0. We can do this by showing the smallest eigenvalue of 

S n ,12{S n) 22 ~ Z nji Ip- k )~ 2 S n7 21 goes to OO. 

Write 

p—k t 

< 3 n,12l,'-'n,22 — "n^p-h) °n,21 — 7r ^' 

a=l — ^n,jj 

where s„,i,s„, 2 , • • • ,s niP _ fc are the eigenvectors of S„ j22 multiplied by S n , 12 = <Sy 2i , 
and \ n ,i, A„ j2 , . . . , \ n , P -k are the eigenvalues of 5 , „ j22 . For e > 0, define the event 
J n (e) = {z n>i < b 7 + e} . From the interleaving inequality, we have z n>i > X n;i . With 
respect to the ordering on positive-definite matrices, we have 



p—k x 



p—k 



a=l (,^n,a Z n ^) 2 Q _j (A na -^n,?)^ 



p—k 

^E 



T 



(6 7 + £ - z n4 ) 2 



on J n (e) 



It is not hard to show that 



Ei— 1 



(fc 7 +e-2n,i) 2 



0. Therefore, 



p—k 

E 



T 



(6 7 + £ - 



7 1 m / (& 7 + e) d2 m'(b y + e) f 
l + 7 -i m (6 7 + £ ) [m(6 7 + £)] 2 *' 



As n — > 00, we have P(j n (e)) — > 1. So, since m'(& 7 +£) > ^= for some constant C, let- 
ting e^Owe must have that the smallest eigenvalue of S n ,i2 (£71,22 — z n ^I p -k)~ 2 S n ^i 
goes to 00. 



3.7.3 Left singular vectors 

We can get the left singular vectors from the right from multiplication by ~^X n . 
Specifically, if v n ^ is a right singular vector of with singular value z l Jj , then 

u n> i, the corresponding left singular vector, is defined by 



3.8. RESULTS FOR 7 e (0, 1) 



55 



We are only interested in the first k components of u nyi . If 

J / 3£n,i 



Vn..i 



J 



£n,i\\2 \ —(S nt 22 — Zn,ilp~k) 1 <S'n,21 %n,i 

then these are given by ^ 1 ^ R n (z nji ) x n>i , where 

R n ( z ) — D n H — —E n>11 —E n) i2 (S n> 22 — zlp-k) 1 S>i,21- 

v n v ^ 

It is not hard to show that R n (z) Rq(z), uniformly for z > 6 7 , and R n (z) 
Rq(z) + -^R rh i(z) + op (y-^J pointwise for z > 6 7 . Here, 

1 + 7 i m(z) 

and 

i^z) = (l + 7- 1 m(^)~ 1 ( v ^(A,-r>) + £„,ii) 
-(l+7- 1 m(z))" 2 J F n (^) J D-G n (^). 

Let y„ 5 j = || - 1 || 2 -R ra (z ra; i) x Wi j. A straightforward calculation shows the following: 
Lemma 3.21. If fii > 7 -1 / 2 and {z n ^ x n ,i) ^4 (//j,ej), £/ien 

-(f A*i < 7 -1 / 2 and ^4' b^, then 

a.s. _ 



3.8 Results for 7 G (0, 1) 



Remarkably the formulas for the limiting quantities still hold when 7 < 1. To see 
this, we can get the behavior for 7 < 1 by taking the transpose of X n and applying 
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the theorem for 7 > 1. We switch the roles of n and p, replace /!« by /4 = 7/ij, and 
replace 7 by 7' = 7" 1 . Then, for instance, the critical cutoff becomes 



1 

Hi > -=, i.e. 



11H > 7 1/2 , 



which is the same formula for 7 > 1. The almost-sure limit of the square of the ith 
eigenvalue of ^X n X^ becomes 

A = torn + 1) (1 + ^ 

= 7ta + l)(l + ^ 
= lik- 

The formulas for the other quantities also still remain true. 



3.9 Related work, extensions, and future work 

With the proofs completed, we now discuss some extensions and related work. 



3.9.1 Related work 

When Johnstone [45] worked out the distribution of the largest eigenvalue in the 
null (no signal) case, he proposed studying "spiked" alternative models. Spiked data 
consists of vector-valued observations with population covariance of the form 

£ = diag(/ii,/z 2 , . . . 1, . . . , 1). 



Work on the spiked model started with Baik et al. [11], Baik & Silverstein [12], and 
Paul [68]. Baik et al. showed that for complex Gaussian data, a phase transition 
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phenomenon exists, depending on the relative magnitude of the spike. Baik & Sil- 
verstein gave the almost-sure limits of the eigenvalues from the sample covariance 
matrix without assuming Gaussianity. Paul worked with real Gaussian data and 
gave the limiting distributions when 7 > 1. Paul also gives some results about the 
eigenvectors. 

After this initial work, Bai & Yao [8] [9] derived the almost-sure limits and prove a 
central limit theorem for eigenvalues for a general class of data that includes colored 
noise and non-Gaussianity. Chen et al [19] consider another type of spiked model 
with correlation. Nadler [62] derived the behavior of the first eigenvector in a spiked 
model with one spike. 

The above authors all consider data of the form X = ZE 1 / 2 . Onatski [64], like 
us, examines data of the form X = UDV T + E. With slightly different assumptions 
than ours, he is able to give the probability limits of the top eigenvalues, along with 
the marginal distributions of the scaled eigenvalues and singular vectors. However, 
Onatski does not work in a "transpose-agnostic" framework like we do, so his methods 
do not allow getting at the joint distribution of the left and right singular vectors. 

3.9.2 Extensions and future work 

We have stopped short of computing the second-order behavior of the singular vectors, 
but no additional theory is required to get at these quantities. Anyone patient enough 
can use our results to compute the joint distribution of ||5 n ,i||2, x n ,i, an d y n ,i- This in 
turn will give the joint behavior of the singular vectors. 

Most of the proof remains unchanged for complex Gaussian noise, provided trans- 
pose ( T ) is replaced by conjugate-transpose ( H ). The variance formulas need a small 
modification, since the fourth-moment of a real Gaussian is 3 and that of a complex 
Gaussian is 2. 

For colored or non-Gaussian noise, we no longer have orthogonal invariance, so 
the change of basis in Section 3.3 is a little trickier. It is likely that comparable 
results can still be found, perhaps using results on the eigenvectors of general sample 
covariance matrices from Bai et al. [3]. 
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Chapter 4 



An intrinsic notion of rank for 
factor models 

As Moore's Law progresses, data sets measuring on the order of hundreds or thousands 
of variables are becoming increasingly more common. Making sense of data of this 
size is simply not tractable without imposing a simplifying model. One popular 
simplification is to posit existence of a small number of common factors that drive 
the dynamics of the data, which are usually estimated by principal component analysis 
(PC A) or some variation thereof. The n x p data matrix X is approximated as a 
low-rank product X UV T , where U and V are n x k and px k, respectively, with 
k much smaller than n and p. 

The number of algorithms for approximating matrices by low-rank products has 
exploded in recent years. These algorithms include archetypal analysis [21], the semi- 
discrete decomposition (SDD) [49], the non-negative matrix factorization (NMF) [52], 
the plaid model [51], the CUR decomposition [25], and regularized versions thereof. 
They also include some clustering methods, in particular fc-means and fuzzy fc-means 
[14]. 

A prevailing question is: How many common factors underlie a data set? Alter- 
nately, how should one choose k? In general, the answer to this question is application- 
specific. If we are trying to use X to predict a response, y, then the optimal k is 
the one that gives the best prediction error for y. The situation is not always this 
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simple, though. For exploratory analysis, there is no external response, and we want 
to choose a k that is "intrinsic" to X . For other applications, we don't have a single 
response, y, we have many responses y±, y^ . . . , y m . We may not even know all of the 
y,i when we are processing X. We want a k that has good average-case or worst-case 
prediction properties for a large class of responses. 

In this chapter, we develop a precise notion of intrinsic rank for "latent factor" ma- 
trix data. This choice of k is the one that minimizes average- or worst-case prediction 
error over all bilinear statistics. We specifically work with the singular value decom- 
position (SVD), but our definition can be extended to other matrix factorizations as 
well. 

Section 4.1 introduces the latent factor model, the data model from which we 
base our constructions. Section 4.2 defines intrinsic rank as the minimizer of a loss 
function. We give a theoretical analysis of the behavior of some natural loss functions 
in Section 4.3, followed by simulations in Section 4.4. In Section 4.5, we examine the 
connection between intrinsic rank and the scree plot. Finally, Section 4.6 discusses 
some extensions and Section 4.7 gives a summary of the chapter. 

4.1 The latent factor model 

We start by describing a model for data generated by a small number of latent factors 
and additive noise. Suppose that we have n multivariate observations X\,X2, ■ ■ ■ ,x n G 
W. In a microarray setting, we will have about n = 50 arrays measuring the activa- 
tions of around p = 5000 genes (or 50000, or even millions of genes in the case of exon 
arrays). Alternatively, for financial applications will measure the market value of 
on the order of p = 1000 assets on day i, and we may be looking at data from the 
last three years, so n m 1000. In these situations and others like them, we can often 
convince ourselves that there aren't really 5000 or 1000 different things going on in 
the data. Probably, there are a small number, k of unobserved factors driving the 
dynamics of the data. Typically, in fact, we think k is on the order of around 5 or 10. 

To be specific about this intuition, for genomics applications, we don't really 
think that all p = 5000 measured genes are behaving independently. On the contrary, 
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we think that there are a small number of biological processes that determine how 
much of each protein gets produced. In finance, while it is true that the stock prices 
of individual companies have a certain degree of independence, often macroscopic 
effects like industry- and market-wide trends can explain a substantial portion of the 
value. 



4.1.1 The spiked model 

One way to model latent effects is to assume that the Xi are iid and that their 
covariance is "spiked" . We think that Xj is a weighted combination of k latent factors 
corrupted by additive white noise. In this case, x% can be decomposed as 



Xi 



k 



3 S i,3 



(4.1) 



where A = (^ai a 2 ■ ■ ■ is a p x k matrix of latent factors common to all 

observations and Sj 6 l fc is a vector of loadings for the ith observation. We assume 
that the noise vector is distributed as A/"(0, <J 2 I p ). If the loadings have mean zero 
and covariance S s e R kxk , and if they are also independent of the noise, then Xj has 
covariance 

£ = E [xi xj] = ASg A T + a 2 I p . (4.2) 
The decomposition in (4.2) can be reparametrized as 



S = QAQ T + a 2 I p , 



(4.3) 



where Q T Q = Ik and A = diag (Ai, A 2 , . . . , A*,) , with Ai > A 2 > • • • > A^ > 0. 
Equation (4.3) makes it apparent that S is "spiked" , in the sense that most of its 
eigenvalues are equal, but k eigenvalues stand out above the bulk. The first k eigen- 
values are \i + a 2 , \ 2 + u 2 , . . . , \ k + a 2 , and the remaining p — k eigenvalues are equal 
to a 2 . 
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4.1.2 More general matrix models 

We can introduce a model more general than the spiked one by specifying a distri- 
bution for the n x p data matrix X = (^xi x 2 ■ ■ ■ x^j that includes dependence 
between the rows. In the spiked model, the distribution of X can be described as 

X = ZA^ 2 Q T + E, (4.4) 

where E = ^ e 2 ■ ■ ■ e^j and Z is an n x k matrix of independent A/"(0, 1) 
elements. More generally, we can consider data of the form 

X = y/nUDV T + E, (4.5) 

where U T U = V T V = Ik, and D = diag(di, d 2 , ■ ■ ■ , dt) with di > d 2 > ■ ■ ■ > d k > 0. 
We can get (4.5) from (4.4) by letting ZA 1/2 = ^/nUDC T be the SVD of ZA 1/2 and 
defining V = QC. Unlike the spiked model, (4.5) can model dependence between 
variables as well as dependence between observations. 



4.2 An intrinsic notion of rank 

With Section 4.1's latent factor model in mind, we turn our attention to defining the 
intrinsic rank of a data set. This definition will be motivated both by the generative 
model for X and by predictive power considerations. When X = y/nUDV T + E, 
we think of y/n UDV T as "signal" and E as "noise" . We make a distinction between 
the generative rank and the effective rank. 

Definition 4.1. // the n x p matrix X is distributed as X = y/nUDV T + E, 

where U T U = V T V = Ik , D is a k x k diagonal matrix with positive diagonal 
entries, and E is a noise matrix independent of the signal term whose elements are 
iidj\f(0, a 2 ), then we denote by k the generative rank of X. 

Intuitively the generative rank is the rank of the signal part of X. 

The effective rank is defined in terms of how well the first terms of the SVD 
of X approximates the signal ^/nUDV T . We let X = t/uUDV be the full 
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SVD of X, where U — \y,i w 2 • • • m„a p ) , V — [vi V2 ••• VnApj > an d D = 

diag (di, d 2 , ■ ■ ■ , d nAp ^ . If we let D(k) = diag (di, d 2 , ■ ■ ■ , dk, 0, 0, . . . , 0^ , then the 

SVD truncated to k terms is = ^UD(k)V . We are now in a position to 
define effective rank 



Definition 4.2. Given a loss function L : ]R nx P x IR raxp — ► M, t/ie effective rank of X 
with respect to L is equal to 

k* L = argmin L {^/nUDV T , X(k)^j . (4.6) 

The effective rank depends on the choice of loss function. In some settings it 
is preferable to choose an application-specific loss function. Often, we appeal to 
simplicity and convenience and choose squared Frobenius loss. Specifically, 

L F (A,A') = \\A-A%. (4.7) 

One way to motivate this loss function is that it measures average squared error over 
all bilinear statistics of the form a T A (3, 

— \\A-A%= [ (a T Ap-a T A'p) 2 dadp 
np J 

1 1 ct|| 2 = 1 , 

(see Section A. 3. 2 for details). In the context of the latent factor model, the effective 
rank with respect to L F is the rank that gives the best average-case predictions of 
bilinear statistics of the signal part (with respect to squared-error loss). 

A common alternative to Frobenius loss is spectral loss, given by 

L 2 (A, A') = \\A-A'\\l (4.8) 
This can be interpreted as worst-case squared error over the class of all bilinear 
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statistics, 

\\A-A'\\l= sup (a T A§-a T A'§) 2 . 



IM|2 = 1, 

ll/3||2=l 



In the sequel, we denote the optimal ranks with respect to Frobenius and spectral 
loss as kp and k 2 , respectively. 



4.3 Loss behavior 

In this section we investigate the behavior of the loss functions introduced in Sec- 
tion 4.2. First, we need to be more precise about our working assumptions on the 
data matrices. The theory is easier if we work in an asymptotic setting, introducing 
a sequence of data matrices Xi,X 2 , ■ ■ ■ ,X n , where X n G M. nxp , p = pin), n — > oo, 
and | — > 7 G (0, oo). We will need three assumptions. 

Assumption 4.3. The matrix X n G lR nxp can be decomposed as 

X n = ^U n D n Vl + E n . (4.9) 

Here, U n G R nxk ° , D n G M fc ° xfco , and V n G M pxfc °. The left and right factors U n 
and V n satisfy U^TJ n = V^V n = T7ie aspect ratio satisfies ^ = 7 + / or 
a constant 7 G (0, 00). TTie number of factors k is fixed. 

Assumption 4.4. The matrix of normalized factor strengths is diagonal with D n = 
diag (d n> i, d n> 2, ■ ■ ■ , d n ^ ) ■ For 1 < i < k , the diagonal elements satisfy d 2 n i ^4' /i« /or 
deterministic ^ satisfiying a x > a 2 > • • • > u ko > 0. 

Assumption 4.5. The noise matrix E n has iid elements independent of U n , D n , 
and V n , with £^,11 ~ jV(0, a 2 ). 

These assumptions allow us to apply the results of Chapter 3 to get the first-order be- 
havior of the SVD of X n . As before, we let u n> i, u n>2 , . . . , u n ^ and v ni i,v n ,2, ■ ■ ■ , V n ,k 

~ - - T 

denote the columns of U n and V n , respectively. We set X n = y^nU n D n V n to 
be the SVD of X n , where the columns of U n and V n are u n ,i,Un,2, ■ ■ ■ ,Un,nA P and 
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Vn,uVn,2, ■ ■ ■ ,v n , nAp , respectively. With D n = diag . . • ,/t^ Ap ) , we set 

D n (k) = diag /}^ 2 2 , . . . , fi 1 ^ 0, 0, . . . , 0) so that X n (k) = U n D n (k)V~l is 

the SVD of X n truncated to k terms. 

We can decompose the columns of V n into sums of two terms, with the first term 
in the subspace spanned by V n , and the second term orthogonal to it. By setting 
& n = V^Vn and taking the QR decomposition of V n — V n @ n , the matrix of right 
factors can be expanded as 

v n = v n e n + v n e n , (4.10) 

with V n E W x(p - k ^ satisfying V T a V n = I p _ fco and VZV n = 0. We choose the signs 
so that @ n has non-negative diagonal entries. Note that 

i k = v T n v n = &J l & n + e T n e n . (4.11) 

In particular, since 0„ is upper-triangular, if © n converges to a diagonal matrix ©, 

1/9 

then 0„ must also, converge to a diagonal matrix, (J fe — © 2 ) . This makes the 
decomposition in equation (4.10) very convenient to work with. 

The same trick applies to U n . We expand 

U n = U n $ n + U n <f> n , (4.12) 
with tj n and defined analogously to V n and © n . Again, we have that 

I k = &Z*n + K*n (4.13) 

Likewise, if <fr n converges to a diagonal matrix, then $ n must do the same. 

We can new get a simplified formula for X n (k). With the decompositions in 
equations (4.10) and (4.12), we get 

'$ n D n (k)&Z $ n D n (k)eZ' 
t (A0e£ *„£„(*)©» 



x ni k) = vn ( Un u n ) v 

V ' \& n DJk)Q? $ n DJk)&Z \vZl 
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We can get the asymptotic limits of the quantities above. For 1 < i < fc , we set 

{(u,i + a 2 ) (l + when m > 

^ \o ^ * ^ (4.15a) 

a 2 1 1 + -j=j otherwise, 

{\(l-A)(l + —Y 1 when^>4, 
V V ifi) \ * v^r' (4.15b) 

otherwise, 

. -4) + when ^> 4, 

^ =< ;YV 7M?A ^ v^' (4.15c) 

otherwise, 

while for i > fc , we set /2j = a 2 ^1 + j and 6i = tpi — 0. For % > 1, we define 



0* = a/1-0 2 , ( 4 - 15d ) 



With 



and 



<Pi = dl-tf. (4.15e) 



= diag (ri /2 , /ii /2 , . . • , /if, 0, 0, . . . , 0) G R" x " (4.16a) 



= diag(0 1 ,0 2 ,...,0 p ), (4.16b) 

$ = diag (</?i, </? 2 ,. • • , (4.16c) 

e = diag(0 1 ,0 2 ,...,0 p ), (4.16d) 

* = diag (^i,^ 2 ,. . . ,<p n ) , (4.16e) 
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Theorems 3.4 and 3.5 give us that for fixed k as n — > oo, 



* n D n (*0©n a -4' &D(k)& T , 
& n D n (k)&l a -4' $£>(A;)e T , 

This result makes it easy to analyze the loss behavior. Letting ^ = for i > k , 
putting fti(k) = fii l{i < k} and 



= I - -V2/-, - -1/2,, x 5 I ( 417 ) 



we have that for || • || being spectral or Frobenius norm, for fixed k as n — > oo, 
-^||v^C/ n D n yT - X n (fc)|| °-4' || diag (Fx(A;), F 2 (fc), . . . , F kvko (k)) ||. 

Thus, 

i ||vfttf„A,VZ - * B (*)|£ - £ WFMWl (4.18) 



and 

^||v^C/ nJ D„y^-X n (fc)||^-4- max (4.19) 

Similar results can be gotten for other orthogonally-invariant norms. A straightfor- 
ward calculation shows 

Fj(k) Fl (k) 

' ' in - 2(p i pL 1 i /2 9 i fl 1 i /2 (k) + efpL^k) -<Pi0ifiy 2 pi /2 (k) + 9 i 9 i fi i (kf 
- Vi 9^] /2 Jh' 2 {k) + 9i9~ifii(k) 9^(k) 
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so that 

tr (Fj(k) Fi(k)) 2 Vl 6 l fh l2 Jh l \k) + pi(k), 

det {F]{k)F l {k))=f i e 2 l ^Hk)- 

When ^ > we can use the identities ipiOip 1 / 2 /i' 1 ^ 2 — 1 — and (pfO 2 = to 
get 

tT(Fj(k)Fi(k)) = \ ™ V 1 ^ ^ " (4.20a) 
I /Xj otherwise, 

det(Fj(k)F l (k)) = \^J ^ >Kmi ' 



otherwise. 



[tr (Fj(k) F^k))} 2 - 4 det {Fj(k) F t {k)) }• 



(4.20b) 



When /ij < we have 

ti(Fj(k)Fi(k)) = l \ ^) (4.21a) 

l/ij otherwise, 

det(Fj(k)F t (k)) = r V ^ (4.21b) 

I otherwise. 

We can use these expressions to compute 

\\F t (k)\\ 2 F = tr(Fj(k)F t (k)), (4.22) 
\\F t (k)\\l = l{tr(Fj(k)F t (k)) 

+ 



(4.23) 



The expression for the limit of ^\/nU n D n V^ l — X n {k)^\ is fairly complicated. In 
the Frobenius case, we have 
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Proposition 4.6. For fixed k as n — > oo ; we /lave 

-||v^C/ nJ D„^-X n (A;)|| 2 F 

n 

k 



»4^ l+ ^ /tl + Wl + - -(k-k ) + , (4.24) 

i=l i=fe+l ^ V '/ 

f^(3 ( r 2 + (7 + l)/i i ) ifni>^=, 

ai =h'* K 2 , U J M J " ^ (4.25) 

1 1 + ^7 i 1 + 7y j o^cruwe. 

Figure 4.1 shows aj as a function of fa. It is beneficial to include the ith term 
when ttj < 1, or equivalently fa > fa%, with 



This gives us the next Corollary. 
Corollary 4.7. As n — > oo ; 

A;^ ^4' max {« : /ij > , 
provided no fa is exactly equal to jj* F . 

2 

The theory in Chapter 3 tells us that when fa > the ith signal term is "detectable" 
in the sense that the ith sample singular value and singular vectors are correlated with 
the population quantities. Proposition 4.6 tells us that when fa G /ipj , the ith 
signal term is detectable, but it is not helpful (in terms of Frobenius norm) to include 
in the estimate of y/nU n D n Vj t . Only when fa surpasses the inclusion threshold fa% 
is it beneficial to include the ith term. 

Figure 4.2 shows the detection and inclusion thresholds as functions of 7. 
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5 10 15 20 

Signal Strength 



Figure 4.1: Frobenius LOSS penalty. Relative penalty for including the ith factor 
in the SVD approximation of y/nU n D n Vn, with respect to squared Frobenius loss. 
When the zth factor has signal strengh the cost for excluding the zth term of the 
SVD is fMi, and the cost for including it is ati ■ m. Here, we plot a« as a function of \ii 
for various aspect ratios 7 = -. The units are chosen so that a 2 = 1. 
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Figure 4.2: Signal strength thresholds. Detection threshold 7- 1 / 2 and inclu- 
sion threshold /i F = h y ( 1+ 2^ + ^ plotted against the aspect ratio 7 = ^. 

When the normalized signal strength is above the detection threshold, the zth 
sample SVD factors are correlated with the population factors. With respect to 
Frobenius loss, when the normalized signal strength is above the inclusion threshold, 
it is beneficial to include the ith term in the SVD approximation of y/nU n D n V^. 
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4.4 Simulations 

We confirm the theory of the previous section with a Monte Carlo simulation. We 
generate a matrix X as follows: 

1. The concentration 7 is one of {0.25, 1.0,4.0}. 

2. The size of the matrix s is one of {144,400, 1600,4900}. We set the number of 
rows and columns in the matrix as n = v /s~7 and p = a/ s/7, respectively. This 
ensures that 7 = ^ and s = np. 

3. The noise level is set at o 2 = 1. The noise matrix E is an n x p matrix with 
iidA/"(0, 1) elements. 

4. The generative rank, k , is fixed at 5. The normalized factor strengths are set at 
(/ii, /i2, H3, /i4, /X5) = (4/ip, 2/ip, /ip, |// F , |/ip), and the factor strength matrix 
is D = diag (/i^ 2 , /4 /2 , • • • , /4 /2 ) • 

5. The left and right factor matrices U and V are of sizes n x 5 and p x 5, 
respectively. We choose these matrices uniformly at random over the Stiefel 
manifold according to Algorithm A.l in Appendix A. 

6. We set X = ^/nUDV T + E and let X(k) be the SVD of X truncated to k 
terms. 

After generating X, we compute the squared Frobenius loss ^|| y/nUDV T — X(k) \\ 2 F 
and the squared spectral loss ^\\y/nUDV T — X(k)\\l as functions of the rank, k. In 
Figures 4.3 and 4.4, we plot the results over 500 replicates. For the Frobenius case, 
we should expect the loss to decrease until k = 2, stay flat at k — 3, and then increase 
thereafter. This is confirmed by the simulations. The spectral norm behaves similarly 
to the Frobenius norm. 
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Figure 4.3: Simulated Frobenius LOSS. Squared Frobenius loss ^\\ v ^nUDV' T - 
X(k)\\ F as a function of the rank, k, for X generated according the procedure de- 
scribed in Section 4.4 and X(k) equal to the SVD of X truncated to k terms. 
The concentration 7 = ^ is one of {0.25,1.0,4.0} and the size s = np is one of 
{144,400,1600,4900}. The solid lines show the means over 500 replicates with the 
error bars showing one standard deviation. The dashed lines show the predictions 
from Proposition 4.6. We can see that as the size increases, the simulation agrees 
more and more with the theory. 
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Figure 4.4: Simulated spectral LOSS. Squared spectral loss ^\\y/nUDV T - 
_X"(A;)||| as a function of the rank, k, with X and X(k) as in Figure 4.3. Sample size 
s = np is shown in the columns and concentration 7 = ^ is shown in the rows. Solid 
lines show the means over 500 replicates with the error bars showing one standard 
deviation; the dashed lines show the predictions from Section 4.3, specifically from 
equations (4.19), (4.20a-4.21b), and (4.23). The simulations agrees quite well with 
the theory, especially for large sample sizes. 
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4.5 Relation to the scree plot 

Cattell's scree plot [18] is a popular device for choosing the truncation rank in prin- 
cipal component analysis. One plots the square of the singular value, df., against the 
component number, k. Typically such a plot exhibits an "elbow", where the slope 
changes noticeably. This is the point at which Cattell recommends truncating the 
SVD of the matrix. 

In some circumstances, the elbow is close to the Frobenius and spectral loss mini- 
mizers, k F and k* 2 . In Figure 4.5, we generate a matrix X = ^/nU DV T + E e IR" xp , 
with n = p = 100, such that elements of E are iid jV(0, 1). In the first column, we set 
D 2 = diag(5.0, 4.75, 4.5, . . . , 0.5, 0.25). The figure shows the scree plot along with 
W^/nU DV T — X(A;)|| 2 for Frobenius and spectral norms, where X{k) is the SVD 
of X truncated to k terms. There is substantial ambiguity in determining the loca- 
tion of the most pronounced elbow. Despite this subtlety, there is indeed an elbow 
relatively close to k F and k 2) the minimizers of the two loss functions. 

We can easily manipulate the simulation to get a scree plot with a more pro- 
nounced elbow in the wrong place. In the second column of Figure 4.5, we aug- 
ment the factor strength matrix with three additional large values, so that D 2 = 
diag(20.0, 15.0, 10.0, 5.0, 4.75, 4.5, . . . , 0.5, 0.25). With this modification, there is a 
clear elbow at k — 4. However, compared to the optimal values, truncating the SVD 
at k — 4 gives about a 25% worse error with respect to squared Frobenius loss and 
about 50% worse with respect to squared spectral loss. 

In general, we cannot make any assurances about how close the elbow is to kp 
or k 2 . Through a simulation study, Jackson [42] provides evidence that if the latent 
factors are strong enough to be easily distinguished from noise, then the elbow is a 
reasonable estimate of the loss minimizers (he does not actually compute the loss 
behavior, but this seems likely). However, when there are both well-separated factors 
and factors near the critical strength level /x F , the second example here illustrates 
that the elbow might be a poor estimate. 
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Figure 4.5: Scree plots and LOSS functions. We generate a matrix as X = 
^JnUDV 1 ' + E and set X(k) equal to the SVD of X truncated to k terms. The left 
and right columns show different choices of D, described in the text. The top row 
contains scree plots, where the square of the kth singular value of X, d\, is plotted 
against component number, k, with units are chosen so that d\ = 1. The next 
two rows show \\y/nUDV T - X(k)\\l and \\^UDV T - X{k)\\ 2 2 , as functions of 
the rank, k with units chosen so that the minimum value is 1.0. Dashed lines show 
the elbow of the scree plot and the minimizers of the two loss functions. The elbow of 
the scree plot (which is fit by eye) gives a reasonable estimate of the loss minimizers 
in the first column, but not in the second. 
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4.6 Extensions 



We have focused specifically on truncating the singular value decomposition because 
that is the case for which the theory has been most developed. We have also focused 
specifically on Frobenius and spectral losses. One could easily extend our work to look 
at the nuclear norm loss (sum of singular values) [32] or Stein's loss [43]. Alternatively, 
it is not hard to adapt our analysis to study forms like \\UD(k) 2 U — UD 2 U \\. 
For matrix decompositions beyond the SVD, extension of our work is more difficult, 
mainly because very little scholarship has been devoted to their theoretical properties. 



We have not examined shrinking the singular values at all, but in some situations 
this may be beneficial. For example, the Frobenius norm of the Fi matrix from 
Section 4.3, which is involved in the Frobenius loss \\^/nU DV^ — X(k)\\ 2 F , converges 
to 

Hi - 2ip i e i /j, 1 i / ' 2 fi 1 i /2 + pi 

whenever k > i. Recall that pi is the almost-sure limit of the d 2 , the square of the 
ith singular value of -^X. If we shrink the ith singular value, replacing di with f(di) 
for continuous /(•), then the Frobenius penalty for including the ith term converges 
to 

^-2 V Afi} /2 fm] /2 ) + f 2 mi /2 ). 

This quantity is minimized when fip^J 2 ) = ^ 2 (pi9i = ]i i ^ — j . After some 
algebra, the optimal / takes the form 

, (d 2 - 2(1+ l -)a 2 +(1-I) 2 ^V /2 when^Xrfl + ^V 
f(di) = ^ 7 di) \ (4.27) 

otherwise. 



With this shrinkage, it is always beneficial (in terms of Frobenius norm) to include 
the ith term. 
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4.7 Summary and future work 

We have described a plausible model for data generated by a small number of latent 
factors corrupted by additive white noise. With this model in mind, we have moti- 
vated two loss functions, the squared Frobenius norm and squared spectral norm, as 
measuring average- or worst-case quadratic error over the class of all bilinear statis- 
tics. These loss functions in turn determine the intrinsic ranks kp and k%- We have 
shown how the losses and the ranks behave for the truncated SVD, both theoretically 
and through simulation. Finally, we have explored the relation between intrinsic rank 
and the scree plot, and then discussed some extensions. 

We did not describe a way to estimate the error from truncating an SVD, nor 
did we propose a practical method for choosing k. For now, our hope is that this 
work is useful in developing intuition for how the SVD behaves and that it provides a 
suitable starting point for designing and evaluating such procedures. In later chapters, 
we explicitly discuss estimation and rank selection. 



Chapter 5 



Cross-validation for unsupervised 
learning 

The problem unsupervised learning (UL) tries to address is this: given some data, 
describe its distribution. Many estimation problems can be cast as unsupervised 
learning, including mean and density estimation. However, more commonly unsuper- 
vised learning refers to either clustering or manifold learning. A canonical example 
is principal component analysis (PCA). In PCA, we are given some high-dimensional 
data, and we look for a lower- dimensional subspace that explains most of the variation 
in the data. The lower-dimensional subspace describes the distribution of the data. In 
clustering, the estimated cluster centers give us information about the distribution of 
the data. The output of every UL method is a summary statistic designed to convey 
information about the process which generated the data. 

Many UL problems involve model selection. For example, in principal component 
analysis we need to choose how many components to keep. For clustering, we need 
to choose the number of clusters in the data. Many manifold learning techniques 
require choosing a bandwidth or a kernel. Often in these contexts, model-selection is 
done in an ad-hoc manner. Rules of thumb and manual inspection guide most choices 
for how many components to keep, how many clusters are present, and what is an 
appropriate kernel. Such informal selection rules can be problematic when different 
researchers come to different conclusions about what the right model is. Moreover, 
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even when there is an obvious "natural" model to human eyes, it may be hard to pick 
out in computer-automated analysis. For objectivity and efficiency, it is desirable to 
have a well-specified automatic model selection procedure. 

For concreteness, in this chapter we focus on principal components, though many 
of the ideas generalize to other methods. We suppose that the data, X, is an n x p 
matrix generated according to the signal-plus-noise model X = y/nUDV^ + E. We 
consider the first term to be the "signal" matrix, and the second term to be "noise". 
Often the signal term is a low-rank product. Our goal is to estimate this term as best 
as possible by truncating the singular value decomposition (SVD) of X. Here "best" 
means with respect to the metrics introduced in Chapter 4. We are interested in the 
model selection problem where each model is defined by the number of terms we keep 
from the SVD of X. 

We would like our model selection procedure to be non-parametric, if possible. 
To work in a variety of contexts, the selection procedure cannot assume Gaussianity 
or independence across samples. We would like the procedure to be driven by the 
empirical distribution of the data. Cross-validation (CV) is a popular approach to 
model selection that generally meets these criteria. Therefore, we seek to adapt CV 
to our purposes. 

CV prescribes dividing a data set into a "test" set and a "train" set, fitting a model 
to the training set, and then evaluating the model's performance on the test set. We 
repeat the fit /evaluate procedure multiple times over different test /train partitions, 
and then average over all replicates. Traditionally, the partitions are chosen so that 
each datum occurs in exactly one test set. As for terminology, for a particular replicate 
the test set is commonly referred to as the held-out or left-out set, and likewise the 
train set is often called the held-in or left-in set. 

Most often, cross-validation is applied in supervised contexts. In supervised learn- 
ing (SL) the data consists of a sequence of (x, y) predictor-response pairs. Broadly 
construed, the goal of supervised learning is to describe the conditional distribution 
of y given x. This is usually for prediction or classification. In the supervised context, 
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for a particular CV replicate there are four parts of data: 



X 



train 



Y 



train 



X 



test 



Y 



test 



Implicit in the description of cross-validation is that the replicates use X test to predict 
Vtest- So, the held-in data looks like 



We extrapolate from X tes t to predict latest- 

It is not immediately obvious how to apply cross-validation to unsupervised learn- 
ing. In unsupervised learning there is no Y; we instead have the two-way partition 



There is nothing to predict! Renaming X to Y does not make the problem any 
better, for then the division becomes: 



Now, there is nothing to extrapolate from to predict latest' For cross-validation to 
work in unsupervised learning, we need to consider more general hold-outs. 

We look at two different types of hold-outs in this chapter. The first, due to Wold, 
is "speckled" : we leave out random elements of the matrix X and use a missing data 
algorithm like expectation-maximization (EM) for prediction. The second type of 
hold-out is "blocked" . This type, due to Gabriel, randomly partitions the columns of 






with hold-in 
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X into "predictor" and "response" sets and then performs the SL version of cross- 
validation. 

5.1 Assumptions, and notation 

We will generally assume we have data X E lR nxp generated according to the latent 
factor model 

X = y/nUDV T + E. (5.1) 

Here, U E R nxk °, V E R nxk », and D = diag(di, d 2 , . . . , d ko ), with U T U = I ko 
V T V = I ko , and d t > d 2 > ■ ■ ■ > d ko > 0. We call ^UDV T the signal part and E 
the noise part. In the spirit of data-driven analysis, we avoid putting distributional 
assumptions on U, D, V, and E. This makes the terms unidentifiable. While this 
indeterminacy can (and should!) bother some readers, for now we will plod on. 
We denote the SVD of X by 

X = ^UDV T , (5.2) 

with U E R n * nA P, V E RP* nA P, and D = diag(di, d 2 , . . . , d nAp ). Here, U^U = 
V V = I nAp and the singular values are ordered di > d 2 > ■ ■ ■ > d nAp > 0. We set 
D(k) = diag(di, d 2 , . . . , d k , 0, . . . , 0) E R n ^ n ^p so that 

X(k) = V^UD{k)V T (5.3) 

is the SVD of X truncated to k terms. Similarly, we define U(k) E R nxk and 
V(k) E R pxk to be the first k rows of U and V, respectively. 
We focus on estimating the squared Frobenius model error 

ME(fc) = \\V^UDV T -X(k)\\l (5.4) 



or its minimizer, 



^me = argminME(A;). 

k 



(5.5) 



5.2. CROSS VALIDATION STRATEGIES 
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Here, || • ||| is the sum of squares of the elements. 

Another kind of error relevant to cross-validation is prediction error. We let E' 
be a matrix independent of E but having the same distribution conditionally on U, 
D, and V T . We set X' = y/nUDV T + E' and define the prediction error 

PE(fc) = E\\X' - X{k)\\ 2 F . (5.6) 

Likewise, we set 

k PE = argminPE(A;). (5.7) 

k 

If E is independent of U, D, and V, then 

PE(fc) = E\\y/EUDV T - X(k) + E'\\l 
= mVnUDV T - X(k)\\ 



+ 2E 



tr ((VnUDV T - X(k)) r E'^ + E\\E'\\ 2 F 



= E[ME(k)]+E\\E\\l. (5.8) 

The prediction error is thus equal to the sum of the expected model error and an 
irreducible error term. 

Finally, we should note that our definitions of prediction error and model error 
are motivated by the definitions given by Breiman [15] for cross-validating linear 
regression. 



5.2 Cross validation strategies 

In this section we describe the various hold-out strategies for getting a cross-validation 
estimate of PE(fc). It is possible to get an estimate of ME(fc) from the estimate of 
PE(fc) by subtracting an estimate of the irreducible error. For now, though, we choose 
to focus just on estimating PE(fc). 

For performing K-iold cross-validation on the matrix X, we partition its elements 
into K hold-out sets. For each of K replicates and for each value of the rank, k, we 
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leave out one of the hold-out sets, fit a /c-term SVD to the left-in set, and evaluate its 
performance on the left-out set. We thus need to describe the hold-out set, how to fit 
an SVD to the left-in data, and how to make a prediction of the left-out data. After a 
brief discussion of why the usual (naive) way of doing hold-outs won't work, we survey 
both speckled hold-outs (Wold-style), as well as blocked hold-outs (Gabriel-style). 

5.2.1 Naive hold-outs 

The ordinary hold-out strategy will not work for estimating prediction error. Suppose 
we leave out a subset of the rows of X. After permutation, the rows of X are 

partitioned as | 1 , where X 1 e R niXp and X 2 E M n2Xp , and m + n 2 = n. The 

only plausible prediction of X 2 based on truncating the SVD of Xi is the following: 

1. Let X 1 = ^/nUAV^ be the SVD of X u with D l = diag(^ 1) , d { 2 \ d { £ Ap ). 

2. Let D^k) = diag(4 1 , ) 4V- • • ,4^0, . . . , 0) so that X^k) = y/ntf^D^vl is 
the SVD of Xi truncated to k terms. Similarly, denote by V\(k) e M. pxh the 
first k columns of V\. 

3. Let ( + ) denote pseudo-inverse and predict the held out rows as X 2 (k) = 

x 2 x 1 (kr( Xl (k)x 1 (kr) + x 1 (k) = x 2 v 1 (k)v 1 (k) T . 

The problem with this procedure is that \\X 2 — X 2 (k)\\p decreases with k regardless 
of the true model for X. So, it cannot possibly give us a good estimate of the error 
from truncating the full SVD of X. 

A similar situation arrises if we leave out only a subset of the columns of X. 
To get a reasonable cross-validation estimate, it is therefore necessary to consider 
more-general hold-out sets. 

5.2.2 Wold hold-outs 

A Wold-style speckled leave-out is perhaps the most obvious attempt at a more general 
hold-out. We leave out a subset of the elements of the X, then use the left-in elements 
to predict the rest. 
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First we need to introduce some more notation. Let X denote the set of indices of 
the elements of X, so that X = : 1 < i < n, 1 < j < p}. For a subset Id, 

let I denote its complement X \ I. We use the symbol * to denote a missing value, 
and let 1„ = iU {*}. For / C X, we define the matrix Xi G IR" x p with elements 

, Xij if (i,j) G /, 
= < (5-9) 
* otherwise. 



Similarly Xj, has elements 



(5.10) 



* otherwise. 



Finally, for A G M™ xp , we define 

\\ A \\h= E 4- ( 5 - n ) 

(<j)e/ 

This notation allows us to describe matrices with missing entries. 

A Wold-style speckled hold-out is an unstructured random subset I C X. The held- 
in data is the matrix Xj and the held-out data is the matrix X j. We use a missing 
value SVD algorithm to fit a fc-term SVD to Xj. This gives us an approximation 
UkD k V^, where Uk G M nxfc and Vk G W xk have orthonormal columns and D G 
R kxh is diagonal. We evaluate the SVD on the held-out set by \\UkD k Vj — X/||| 7 . 
Aside from the algorithm for getting UkD^V]: from Xj, this is a full description of 
the CV replicate. 

When Wold introduced this form of hold-out in 1978 [95], he suggested using an 
algorithm called nonlinear iterative partial least squares (NIPALS) to fit an SVD 
to the held-in data. This algorithm, attributed to Fisher and Mackenzie [33] and 
rediscovered by Wold and Lyttkens [94], never seems to have gained much prominence. 
We suggest instead using an expectation-maximization (EM) as is consistent with 
current mainstream practice in Statistics. Either way, there are some subtle issues in 
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taking the SVD of a matrix with missing values. We explore these issues further in 
Section 5.3, and present a complete algorithm for estimating the factors UkD k V]:. 

5.2.3 Gabriel hold-outs 

A Gabriel-style hold-out works by transforming the unsupervised learning problem 
into a supervised one. We take a subset of the columns of X and denote them as the 
response columns; the rest are denoted predictor columns. Then we take a subset of 
the rows and consider them as test rows; the rest are train rows. This partitions the 
elements of X into four blocks. With permutation matrices P and Q, we can write 

P^XQ=( Xn Xl2 ). (5.12) 

Here, In consists of the train-predictor block, X 12 is the train-response block, X 2 \ 
is the test-predictor block, and X 22 is the test-response block. It is beneficial to think 
of the blocks as 

Xn Xi 2 \ 
X 2 \ X 22 J 

A Gabriel-style replicate has hold-in 

X L1 
X 2 \ 

and hold-out X 22 . 

To fit a model to the hold-in set, we take the SVD of X n and fit a regression 
function from the predictor columns to the response columns. Normally, the regres- 
sion is ordinary least squares regression from the principal component loadings to the 
response columns. Then, to evaluate the function on the hold-out set, we apply the 
estimated regression function to X 21 to get a prediction X 22 . 

In precise terms, suppose that there are p\ predictor columns, p 2 response columns, 
ni train rows, and n 2 test rows. Then X n G ]R niXpi and X 22 G ]R n2Xp2 withpx-l-^ = P 
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and rti + n 2 = n. First we fit an SVD to the train-predictor block 

X u = V^U 1 D 1 V T l . 

~ ~ ~ T 

Then, we truncate the SVD to k terms as y/n Ui(k)D 1 (k)V 1 (k) in the same way as 
is discussed in Section 5.1. This defines a projection Vi(k) and principal component 
scores 

Zi(fc) = XuV^k) = y/nUiiQD^k). 
Similarly, we can get the principal components scores for the test-predictor block as 

Z 2 (k) = X 21 V 1 (k). 

Next, we model the response columns as linear functions of the principal component 
scores. We fit the model X 12 ~ Z ± (k)B with 

B = (Z 1 (k) T Z 1 (k)) + Z 1 (kfX 12 = ^=D 1 (k) + U 1 (k) T X 12 . 

\I Tl 



Finally, we apply the model the test rows to get a prediction for the test-response 
block 

X 22 = Z 2 (k)B = X 21 (-L V 1 (k)D 1 (k)+ U^A X 12 . (5.13) 



This gives a complete description of the CV replicate. 

Remark 5.1. Typically for Gabriel-style CV, we have two partitions, one for the rows 
and one for the columns. If the rows are partitioned into K sets and the columns are 
partitioned into L sets, then we average the estimated prediction error over all KL 
possible hold-out sets. This corresponds to — ~ K and — m L. In this situation, we 

r r ri2 P2 ' 

say that we are performing (K, L)-fold Gabriel-style cross-validation. 

We can get some intuition for why Gabriel-style replicates work by expressing 
the latent factor decomposition X = y/n UDV T + E in block form. We let P = 
(Pi P-i^j an d Q = {Q\ Q 2 ^j be the block-decompositions of the row and column 
permutations so that X^ = PjXQ,. We define TJ \ = PjU, Vj = QjV, and 
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E i:j = PjEQj so that 



P T XQ = ( j (V^UDV? + 23) (q x Q 2 ) 



U.DVj U.DVn + /E n E 12 
U 2 DVj U 2 DVl) \E 21 E 22i 

Thus, all four blocks have the same low-rank structure. If the noise is small then 
X 22 pa ^U 2 DV 2 and 



X 22 Pa y /^U 2 D(VjV 1 (k)) D{k) + (U^kfUj DV 



If the noise is exactly zero, k = ko, and rank(X22) = rank(X), then Owen and 
Perry [65] show that X 22 = X 22 . For other types of noise, the next chapter proves a 
more general consistency result. 

5.2.4 Rotated cross-validation 

When the underlying signal UDV T is sparse, it's possible that we will miss it in the 
training set. For example, if U = (l o) and V = (l o) , then 

(d x 0\ 





\0 0/ 



UDV 1 



Either the test set will observe this factor, or the train set, but not both. Only the 
set that contains X n will be affected by the factor. 

One way to adjust for sparse factors is to randomly rotate the rows and columns 
of X before performing the cross-validation. We generate P € IR nx ™ and Q E W xp , 
uniformly random orthogonal matrices, by employing Algorithm A.l. Then, we set 
X = PXQ T and perform ordinary cross-validation on X. Regardless of the factor 
structure in X , the signal part of X will be uniformly spread across all elements of 
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the matrix. We call this procedure rotated cross-validation, abbreviated RCV. 

RCV is similar in spirit to the generalized cross-validation (GCV) described in 
Craven & Wahba [20] and Golub et al. [35]. GCV is an orthogonally-invariant way of 
performing leave-one-out cross validation for regression. The difference is that GCV 
rotates to a specific non-random configuration of the data that gives equal weight to 
all observations in the rotated space, while RCV rotates to a random configuration 
that gives equal weight in expectation. 



5.3 Missing- value SVDs 



To perform a Wold-style cross-validation we need to be able to compute the SVD of 
a matrix with missing entries. This is a difficult problem. First of all, the problem 
is not very well defined. If A is a matrix with missing entries, there are potentially 
many different ways of factoring A as an SVD-like product. Often, one attempts to 
force uniqueness by finding the complete matrix A' e M n ' p of minimum rank such 
that A\ - = Aij for all non-missing elements of A. Even then, A' may not be unique. 
Take 



* 



Then 






and 




are all rank-1 matrices that agree with A on its non- missing entries. We might 
discriminate between these by picking the matrix with minimum Frobenius norm. In 
this case, the matrix 

1 * 
* 1 / 

presents an interesting problem. We can either complete it as 



or 
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The first option has rank 1 and Frobenius norm 2. The second option has higher 
rank, 2, but lower Frobenius norm, y/2. Which criterion is more important? It it not 
clear what the "right" SVD of A is. 

We can alleviate the problem by considering a sequence of SVDs rather than a 
single one. For each k = 0,1,2,..., define A' k to the rank-/c matrix of minimum 
Frobenius norm that agrees with A on its non-missing elements. If no such matrix 
exists, let / C X be indices of the non-missing elements of A and define the candidate 

sets 

A k = {A k e R nxp : rank(A fc ) = k} 
C k = {A k E A k : || A - A fc || FjJ = mm \\A - B fc || F>J }. 

-Bfc€-4fc 

Lastly, put 

A' k = argmin||A fc || Fj/ . 

We then define the rank-fc SVD of A to be the equal to the SVD of A' k . 

There are still some problems with these SVDs. First of all, in general there may be 
no relationship between A' k and A' k+l ; the two matrices may be completely different 
and have completely different SVDs. We lose the nesting property of ordinary SVDs, 
where the rank-A; SVD is contained in the rank-(/c + 1) SVD. Secondly, although it 
seems plausible, we do not have any guarantees that A' k is unique. Situations may 
arise where two different rank-/c matrices have the same norms and the same residual 
norms on the non-missing elements of A. Finally, finding A' k is a non-convex problem. 
The function \\A — A k \\pj can have more than one local maximum. We need to be 
aware of these deficiencies. 

Rather than get too deep into the missing-value SVD rabbit-hole, we choose in- 
stead to live with an approximation. We acknowledge that computing the best rank-/c 
approximation as defined above is computationally infeasible for large n and p. In- 
stead of proving theorems about the global optimum, we focus on an algorithm for 
computing a local solution. 



5.3. MISSING-VALUE SVDS 



91 



We use an EM-algorithm to estimate A' k . The inputs to the algorithm are k, a non- 
negative integer rank, and A e M™ X;P , a matrix with missing values. The output is A' k , 
a rank-/c approximation of A. The algorithm proceeds by iteratively estimating the 
missing values of A by the values from the first k terms of the SVD of the completed 
matrix. We give detailed pseudocode for the procedure as Algorithm 5.1. This is 
essentially the same algorithm as the SVDimpute algorithm given in Troyanksaya et 
al. [91], except that we use a different convergence criterion. SVDimpute stops when 
successive estimates of the missing values of A differ by less than "the empirically 
determined threshold of 0.01." We instead stop when relative difference of the residual 
sum of squares (RSS) between the non-missing entries and the rank-A; SVD is small 
(usually 1.0 x 10~ 4 or less). Our reason for using a different convergence rule is 
that the analysis of the EM algorithm in Dempster et al. [22] shows that the RSS 
decreases with each iteration, but makes no assurances about the missing values 
converging. Regardless of which convergence criterion is used, the algorithm is very 
easy to implement. 
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Algorithm 5.1 Rank- A; SVD approximation with missing values 

1. Let I = : A? ^ *}■ 

2. For 1 < j 1 < p let be the mean of the non-missing values in column j of A, 
or if all of the entries in column j are missing. 

3. Define A (0) E R nxp by 

|(0) 



4 W - 



Ay if e /, 

jij otherwise. 



4. Initialize the iteration count iV <— 0. 

5. (M-Step) Let 



i=i 



be the SVD of A (N) and let A'^ N) be the SVD truncated to k terms, so that 



i=i 



6. (E-Step) Define A (JV+1) e R nxp as 



7. Set 



^(iv+i) jAj if el, 
tj l^M? otherwise. 



RSS^ = ||A-A; (iV) ||^. 



If |RSS (A ° - RSS (JV - 1} | is small, declare convergence and output A fc as A k . 
Otherwise, increment N <— N + 1 and go to Step 5. 
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5.4 Simulations 

We performed two sets of simulations to gauge the performance of Gabriel- and Wold- 
style cross-validation. In the first set of simulations, we compare estimated prediction 
error with true prediction error. In the second set, we evaluate the methods' abilities 
to estimate k PE , the optimal rank. 

Each simulation generates a data matrix X as X = y/nUDV T + E. We fix the 
dimensions and number of generating signals as n = 100, p = 50, and k = 6. For 
"weak" factors we set D = D wcak = diag (10, 9, 8, 7, 6, 5) and for "strong" factors, 
we set D = D stIong = y/nD WCSLk . We generate U, V, and E independently of each 
other. To avoid ambiguity in what constitutes "signal" and what constitutes "noise" , 
we ensure that the elements of E are uncorrelated with each other. 

We consider two types of factors. For "Gaussian" factors, we put the elements of 
U distributed iid with U u ~ jV(0, \) and the elements of V iid with V u ~ jV(0, ^), 
also independent of U. For "Sparse" factors we use sparsity parameter s = 10% and 
set 

P{C/ n = 0} = 1 - s, 

p{r/n = --^} = p{r/n = +-^} = f. 

Similarly, we put 

F{V U = 0} = 1 - s, 
p{Vn = --U=P{Vu = + -U * 

The scalings in both cases are chosen so that E,[U T U] = E[V T V] = I ko . Gaussian 
factors are uniformly spread out in the observations and variables, while sparse factors 
are only observable in a small percentage of the matrix entries (about 1%). 

We use three types of noise. For "white" noise, we generate the elements of E 
iid with E u ~ A/"(0, 1). For "heavy" noise, we use iid elements with E u ~ cr~ 1 t u , 
and v — 3. Here t v is a t random variable with v degrees of freedom, and a v = 
y/ — 2) is chosen so that EfE 1 ^] = 1. Heavy noise is so-called because it has a 
heavy tail. Lastly, for "colored" noise, we first generate of, . . . ~ Inverse-x 2 (^i) 
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and rf, . . . ,t£ ~ Inverse-x 2 ^) independently, with ui = u 2 = 3. Then we generate 
the elements of E independently as E^ ~ c Zlvi ' of + Tj), where c ultU2 = 

V / l/(^i-2) + l/(z/ 2 -2). A gam, c Ul ^ U2 is chosen so that EfE 1 ?-] = I. Colored noise 
simulates heteroscedasticity. All three types of noise are plausible for real-world data. 

Obviously, this set of simulations comes with a number of caveats. We are only 
looking at two choices for the signal strengths, one choice of n and p, and a single 
hold-out size. Moreover, this example uses relatively small n and p, potentially too 
small for consistency asymptotics to kick in. Despite these deficiencies, the simula- 
tions still convey substantial information about the behavior of the procedures under 
consideration. 

5.4.1 Prediction error estimation 

Our goal with the first simulation was to get intuition for the behavior of Gabriel- 
and Wold-style cross validation as prediction error estimators. We generated random 
data of the form X = ^nUDV T + E, described above. With X{k) being the SVD 
of X truncated to k terms, the (normalized) true prediction error is given by 

PE(Jfe) = \\yftiUDV T - X(k)\\l + 1. 

Cross-validation gives us an estimate PE(fc) of the prediction error curve. We wanted 
to see how PE(fc) compares to PE(fc). 

Figures 5.1 and 5.2 show the true and estimated prediction error curves from 
(2, 2)-fold Gabriel CV and 5-fold Wold CV, along with their RCV variants. The plots 
only show one set of curves for each factor and noise instance, but other replicates 
showed similar behavior. 

For most of the simulations, the cross-validation estimates of prediction error are 
generally conservative. The only exception is with weak factors and colored noise, 
perhaps because of the ambiguity in what constitutes "signal" and what constitutes 
"noise" in this simulation. This global upward bias agrees with previous studies 
of cross-validation (e.g. [15] and [17]). The RCV versions of the methods generally 
brought down the bias. Some authors have observed a downward bias at the minimizer 
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of PE(fc) for ordinary cross-validation ([16], [88]). This bias does not appear to be 
present here. 

A striking difference between Wold- and Gabriel-style CV is their behavior for k 
greater than fcp E . Gabriel-style CV does a better job at estimating the true PE(fc), 
which is relatively flat. Wold-style CV, on the other hand, increases steeply for k past 
the minimizer. In some situations, the Wold-style behavior is more desirable, but as 
the heavy- noise examples in Figure 5.2 illustrate, the steep increase is not always in 
the right place. Gabriel-style cross-validation is better at conveying ambiguity when 
the underlying dimensionality of the data is unclear. 




Figure 5.1: Cross-validation with strong factors. Estimated prediction error 
curves for Gabriel- and Wold-style cross-validation, both original and rotated (RCV) 
versions, with strong factors in the data. The true prediction error is shown in black, 
the CV curves are red, and the RCV curves are blue. Error bars are computed from 
the CV replicates. Despite their upward bias, the methods do well at estimating 
PE(fc) for k < k^ E . 
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Figure 5.2: Cross-validation with weak factors. Estimated prediction error 
curves for Gabriel- and Wold-style cross-validation, both original and rotated (RCV) 
versions, with weak factors in the data. As in Figure 5.1 the true prediction error is 
shown in black, the CV curves are red, and the RCV curves are blue. Error bars give 
are computed from the CV replicates. The methods have a harder time estimating 
PE(/c) and its minimizer. 
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5.4.2 Rank estimation 

In the next simulation, we see how well cross-validation works at estimating the 
optimal rank, especially as compared to other rank-selection methods. We generate 
data in the same manner as before, and record how far off the minimizer of PE(fc) is 
from the minimizer of PE(fc). 

We compared the four CV-based rank estimation methods with seven other meth- 
ods. They are as follows: 

• AIC: Rao & Edelman's AlC-based estimator [71]. 

• BICi, BIC 2 , and BIC 3 : Bai & Ng's BIC-based estimators [2]. 

• F: Faber & Kowalski's modification of Malinowski's F-test, with a significance 
level of 0.05 [31, 54]. 

• MDL: Wax & Kailaith's estimator based on the minimum description length 
principle [93] . 

• UIP: Kritchman & Nadler's estimator based on Roy's union-intersection prin- 
ciple and their background noise estimator, with a significance level of 0.001 
[50, 72]. 

Kritchman and Nadler [50] give concise descriptions of four of the estimators. The 
other estimators, Bai and Ng's BICs, are defined as the minimizers of 

BlCi(Jfe) = log 1 1 X 
BIC 2 (Jfc) = log 1 1 X 

BIC 3 (Jfe) = log 1 1 X 

where C n . p = min(y / n, y/p). 

Tables 5.1-5.4 summarize the results of 100 replicates. For the strong factors in 
white noise, almost all of the methods correctly estimate the true PE-minimizing 
rank. When the noise is non-white, Wold-style CV seems to be the clear winner. 



-X(k)\\l + k-^log^, (5.14a) 
np n + p 

-X^Wl + k^hgC^ (5.14b) 
np 

-Xml + k 1 ^^, (5.14c) 
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Estimated Rank 



Method -7 -6 -5 -4 -3 -2 -1 +1 + 2 +3 +4 +5 +6 +7 > 7 





White Noise 




















CV-Gabriol 




99 


1 
















RCV-Gabriel 




97 


2 


1 














CV-Wold 




100 


















RCV-Wold 




100 


















AIC 




99 


1 
















BICi 




100 


















BIC 2 




100 


















BIC 3 




100 


















F 




100 


















MDL 




100 


















UIP 




100 




















Colored Noise 




















CV-Gabricl 




32 


42 


18 


5 


2 


1 








RCV-Gabriel 




6 


7 


18 


13 


15 


26 


9 


4 


2 


CV-Wold 




97 


3 
















RCV-Wold 




22 


39 


29 


7 


2 


1 








AIC 














2 


9 


19 


70 


BICi 




14 


26 


31 


20 


4 


4 


1 






BIC 2 




23 


41 


24 


9 


1 


2 








BIC3 






1 


1 


3 


4 


5 


3 


4 


79 


F 






4 


11 


29 


27 


15 


12 


1 


1 


MDL 




13 


24 


36 


20 


3 


3 


1 






UIP 






1 


5 


14 


28 


23 


19 


9 


1 




Heavy Noise 




















CV-Gabricl 




61 


35 


4 














RCV-Gabriel 




42 


27 


12 


14 


5 










CV-Wold 




99 


1 
















RCV-Wold 




68 


26 


5 


1 












AIC 




3 


20 


22 


32 


18 


4 


1 






BICi 




65 


27 


7 


1 












BIC 2 




70 


26 


3 


1 












BIC3 




32 


27 


20 


13 


4 


4 








F 




38 


29 


27 


6 












MDL 




64 


28 


7 


1 












UIP 




18 


35 


27 


18 


2 











Table 5.1: Rank estimation with strong Gaussian factors. Difference be- 
tween the estimated rank and the true minimizer of PE(fc) for 100 replicates of strong 
Gaussian factors with various types of noise. 

However, as Figure 5.1 demonstrates, there is not much Frobenius loss penalty for 
slightly overestimating the rank. Therefore, Tables 5.1 and 5.2 may be exaggerating 
the advantage of Wold-style CV. 

For the weak factors in white noise, the AIC, F and UIP methods fare well, and 
the performance of the cross-validation-based methods is mediocre. For non-white 
noise and weak factors, none of the methods perform very well. This is probably due 
to the inherent ambiguity between what constitutes "signal" and what constitutes 
"noise" in these simulations. 
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Estimated Rank 

Method -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7 > 7 



White Noise 

CV-Gabriol 8 73 11 7 1 



RCV-Gabriel 98 2 

CV-Wold 1 8 91 

RCV-Wold 1 99 

AIC 100 

BICi 1 99 

BIC 2 1 99 

BIC 3 100 

F 100 

MDL 100 

UIP 100 



Colored Noise 



CV-Gabriel 








4 


34 


19 


24 


8 


6 


5 








RCV-Gabriel 










1 


5 


13 


13 


22 


20 


10 


8 


8 


CV-Wold 




1 


1 


8 


83 


4 


3 














RCV-Wold 










26 


32 


25 


12 


2 


2 






1 


AIC 




















2 


10 


19 


69 


BICi 










17 


23 


29 


20 


3 


4 


2 




2 


BIC 2 










24 


36 


27 


9 


2 


1 


1 






BIC3 














3 


2 


2 


4 


7 


2 


80 


F 












4 


9 


32 


28 


14 


11 




2 


MDL 










16 


24 


31 


19 


6 


2 


1 




1 


UIP 














4 


17 


27 


23 


18 


9 


2 




Heavy Noise 


























CV-Gabriol 








5 


52 


29 


11 


2 


1 










RCV-Gabriel 










39 


24 


19 


11 


6 


1 








CV-Wold 




1 


1 


11 


83 


4 
















RCV-Wold 










66 


28 


5 


1 












AIC 










2 


21 


24 


33 


11 


6 


3 






BICi 








1 


63 


27 


6 


3 












BIC 2 








1 


71 


22 


5 


1 












BIC3 










28 


31 


19 


14 


7 


1 








F 










32 


41 


14 


12 


1 










MDL 










63 


28 


6 


3 












UIP 










20 


32 


27 


15 


6 











Table 5.2: Rank estimation with strong sparse factors. Difference between 
the estimated rank and the true minimizer of PE(fc) for 100 replicates of strong sparse 
factors with various types of noise. 
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Estimated Rank 

Method -7 -6 -6 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7 > 7 



White Noise 



CV-Gabriel 










3 


56 


32 


9 














RCV-Gabriel 










5 


66 


20 


8 


1 












CV-Wold 








4 


31 


65 


















RCV-Wold 








3 


32 


65 


















AIC 










3 


95 


2 
















BICi 






1 


13 


46 


40 


















BIC 2 




1 


8 


29 


47 


15 


















BIC 3 












99 


1 
















F 












99 


1 
















MDL 








6 


44 


50 


















UIP 












99 


1 


















Colored Noise 




























CV-Gabriel 










2 


9 


22 


21 


25 


10 


3 


5 


2 


1 


RCV-Gabriel 












2 


4 


11 


16 


14 


12 


19 


9 


13 


CV-Wold 


14 4 


7 


9 


17 


14 


34 


4 








2 




1 


1 


RCV-Wold 










2 


29 


24 


17 


8 


6 


7 


2 


4 


1 


AIC 






















4 


14 


14 


68 


BICi 










2 


26 


22 


20 


9 


8 


3 


2 


4 


4 


BIC 2 










4 


39 


28 


12 


5 


4 


1 


3 


3 


1 


BIC3 














1 


1 


5 


11 


1 


3 


4 


74 


F 












1 


7 


16 


24 


17 


13 


10 


1 


11 


MDL 










1 


25 


23 


21 


8 


9 


4 


1 


4 


4 


UIP 














2 


11 


12 


22 


19 


15 


4 


15 




Heavy Noise 




























CV-Gabriel 




1 




1 


2 


10 


51 


25 


8 


1 


1 








RCV-Gabriel 












13 


37 


24 


13 


9 


2 


1 




1 


CV-Wold 


14 6 


4 


13 


17 


21 


32 


1 












1 




RCV-Wold 






1 


2 


13 


59 


16 


4 


2 


1 


1 






1 


AIC 












7 


15 


28 


25 


14 


7 


1 


2 


1 


BICi 






1 


2 


16 


58 


16 


3 


2 




1 






1 


BIC 2 






4 


15 


23 


46 


8 


1 


1 


1 




1 






BIC3 












38 


21 


23 


8 


4 


4 


1 




1 


F 












45 


22 


19 


11 


1 


1 






1 


MDL 








2 


14 


61 


15 


3 


3 




1 






1 


UIP 












27 


29 


27 


9 


4 


3 






1 



Table 5.3: Rank estimation with weak Gaussian factors. Difference be- 
tween the estimated rank and the true minimizer of PE(k) for 100 replicates of weak 
Gaussian factors with various types of noise. 
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Estimated Rank 



Method 
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77 


16 
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21 


45 


29 
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14 
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86 


11 
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85 


15 
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19 


42 


37 
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73 


26 
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CV-Gabriel 








2 


5 


16 


19 


19 


12 


9 


3 
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3 
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RCV-Gabriel 
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2 


7 


12 


15 


19 


11 


8 


25 


CV-Wold 


1 3 6 


11 


10 


17 


21 


25 


2 


2 






1 






1 


RCV-Wold 










2 


22 


21 


19 


12 


11 


3 


5 


3 
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2 


8 


9 


81 


BICi 










1 


22 


23 


20 


10 


9 


4 


2 


3 


6 


BIC 2 






1 




7 


37 


17 


13 


9 


7 


1 


3 


4 


1 


BIC3 
















2 


2 


2 


4 


8 


7 


75 


F 














4 


6 


21 


22 


14 


13 


6 


14 


MDL 












22 


19 


22 


12 


10 


5 


2 


3 


5 


UIP 
















5 


9 


22 


19 


17 


10 


18 




Heavy Noise 




























CV-Gabriel 
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4 


22 


25 


24 


20 


3 










RCV-Gabriel 
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4 


9 


18 


20 


18 


12 


13 


2 


1 


2 


CV-Wold 
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10 


16 


19 


29 


18 




1 










1 




RCV-Wold 








2 


19 


51 


7 


16 


3 


1 






1 




AIC 












2 


17 


16 


19 


22 


8 


4 


10 


2 


BICi 






1 


4 


21 


51 




13 


3 


1 






1 




BIC 2 




2 


4 


16 


32 


31 


4 


9 




1 






1 




BIC 3 










3 


25 


23 


17 


12 


10 


4 


4 


1 


1 


F 










2 


29 


21 


23 


10 


7 


5 


2 




1 


MDL 








3 


19 


52 


6 


15 


3 


1 






1 




UIP 












20 


23 


22 


13 


12 


3 


5 


1 


1 



Table 5.4: Rank estimation with weak sparse factors. Difference between 
the estimated rank and the true minimizer of PE(fc) for 100 replicates of weak sparse 
factors with various types of noise. 
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5.5 Real data example 

We conclude this chapter with a neuroscience application. The Neural Prosthetic Sys- 
tems Laboratory at Stanford University (NPSL) is interested in studying the motor 
cortex region of the brain. Essentially, they want to know what the correspondence is 
between neural activity in that part of the brain and motor activity (movement). Re- 
search is still at a very fundamental level, and the basic question of how many things 
are being represented is still unanswered. It is thought that desired position, speed, 
and velocity get expressed as neural activity, but conjectures about the dimensionality 
of the neural responses vary from 7 to 20 or more. 

NPSL has designed and carried out an experiment meant to measure the di- 
mensionality of neural response for a two-dimensional motion task. The experiment 
involves measuring the activity in 49 neurons as a monkey performs 27 different 
movement tasks (conditions). 

For a particular neuron and condition, a simplified explanation of the experiment 
is as follows: 

1. At time t = ms, start recording neural activity in the monkey. 

2. At time t = 400ms (target-on), show the monkey a target. The monkey is 
not allowed to move at this point. 

3. At a random time between time t = 400 ms and time t = 1560 ms, allow the 
monkey to move. 

4. At time t = 1560 ms (movement), the monkey starts to move and point at the 
target. 

5. Record activity up to but not including time t = 2110 ms. 

Each condition includes a target position and a configuration of obstacles. The same 
monkey is used for every trial. Measurements are taken at 5 ms intervals, so that 
there are 422 total time points. It is necessary to do some registration, scaling, 
and interpolation before doing more serious data analysis, but the details of those 
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processes are not important for our purposes. Figure 5.3 shows the preprocessed 
responses for each neuron and condition. 
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Figure 5.3: Motor cortex data. Response rates in 47 neurons for 27 movement 
tasks. The subplots show the normalized response rates in a single neuron as functions 
of time. Each color corresponds to a different movement task. The plot on the left is 
a zoomed-in view of the data for the first neuron. 



The data from the NPSL motor cortex experiment can be put into a matrix 
where each neuron is thought of as a variable, and the timepoints of each condition 
are thought of as observations. This gives us a matrix X with p = 49 variables and 
n = 27 ■ 422 = 11394 observations. Of course, the rows of X are nothing like iid, and 
the noise in X is not white. Parametric methods are not likely to give very reliable 
estimates of the dimensionality of X , but cross-validation stands a reasonable chance. 

After centering the columns of X, we performed Wold- and Gabriel-style cross- 
validation to estimate the dimensionality of the signal part of X. For Gabriel-style 
CV, we first tried both (2, 2)-fold and (2,49)-fold; both resulting PE(k) curves had 
their minima at the maximum k. For 5-fold Wold-style CV, there is a minimum at 
k = 13. The BIC, F, MDL, and UIP estimators all chose k = 48 as the dimensionality, 
while the AIC estimator chose k = 47. We show the cross-validation estimated 
prediction curves in Figure 5.4. It is likely that the true dimensionality of X is high. 
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Figure 5.4: Motor cortex estimated prediction error. Prediction error 
as a function of rank, estimated by three cross-validation methods. The units are 
normalized so that the maximum prediction error is 1.0. Error bars show one standard 
error, estimated from the folds. The prediction error estimate from Wold-style CV 
shows a minimum at k = 13, but the Gabriel-style CV estimates always decrease with 
k. It is likely that the true dimensionality of the data is high. 
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5.6 Summary and future work 

We have described two different forms of cross-validation appropriate for model se- 
lection in unsupervised learning. Wold-style CV uses a "speckled" leave-out, and 
Gabriel-style CV uses a "blocked" leave-out. We have defined two forms of error 
associated with SVD/PCA-like models, the prediction error and the model error. 
Through simulations, we have shown that both forms of CV can be considered to give 
estimates of prediction error. Both methods perform well, but Wold-style CV seems 
to be more robust to badly-behaved noise. We have applied these cross-validation 
methods to a data analysis problem from a neuroscience experiment. 

We have focused on latent factor models and the singular value decomposition. 
However, it is relatively easy to translate the two styles of cross-validation presented 
here to other unsupervised learning methods. For Wold-style hold-outs, an EM- 
like algorithm can usually be applied to the models in many unsupervised learning 
contexts. The Gabriel-style philosophy of "treat some of the variables as response and 
the others as predictors" can also be applied more broadly. We are in the process of 
investigating both cross-validation strategies for clustering and kernel-based manifold 
learning. 

This chapter leaves a number of open questions. Mainly, we have not provided 
any theoretical results here, only simulations. A quote from Downton's discussion of 
Stone's 1974 paper on cross-validation equally applies to our work: 

A current nine-day wonder in the press concerns the exploits of a Mr. Uri 
Geller who appears to be able to bend metal objects without touching 
them; [The author] seems to be attempting to bend statistics without 
touching them. My attitude to both of these phenomena is one of open- 
minded scepticism; I do not believe in either of these prestigious activities, 
on the other hand they both deserve serious scientific examination [86]. 

Despite Dowton's skepticism, cross-validation has proven to be an invaluable tool 
for supervised learning. It is our hope that with some additional work, CV can be 
just as valuable for unsupervised learning. In the next chapter, we provide some 
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theoretical justification for Gabriel-style cross-validation, but analysis of Wold-style 
cross-validation is still an open problem. 



Chapter 6 



A theoretical analysis of 
bi-cross- validation 

In this chapter we will determine the optimal leave out-size for Gabriel-style cross- 
validation of an SVD, also known as bi-cross- validation (BCV), along with proving 
a weak form of consistency. In Chapter 4, we rigorously defined the rank estimation 
problem, and in Chapter 5 we introduced Gabriel-style cross validation. Here, we 
provide theoretic justification for Gabriel-style CV. 

First, a quick review of the problem. We are given X, an n x p matrix generating 
by a "signal-plus-noise" process, 

X = V^UDV T + E. 

Here, U € R nxk <->, D e R fe ° xfc °, V € M pxfe °, and E G R nxp . The first term, 
y/nUDV T , is the low-rank "signal" part. We call U and V the matrices of left 
and right factors, respectively. They are normalized so that U T U = V T V = 
Jfc . The factor "strengths" are given in D, a diagonal matrix of the form D = 
diag(c?i, g?2, • • • , dk ), with d\ > d 2 > • • • > dk > 0. Also, typically k is much smaller 
than n and p. Lastly, E consists of "noise" . Although more general types of noise are 
possible, for simplicity we will assume that E is independent of U, D, and V. We 
think of the signal part as the important part of X, and the noise part is inherently 
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uninteresting. 

The rank estimation problem is find the optimal number of terms of the SVD 

_ - ~ ~ T 

to keep to estimate the signal part. We let X = ^JnUDV be the SVD of 



X, where TJ E j^nxnAp an( j y ^ ^pxnAp h ave orthonormal columns, and D = 
diag(rfi, d 2 , . . . , d nAp ) with d\ > d 2 > • • • > d n/sp . For < k < n A p, we define 
D(k) = diag(di,d 2 ,...,4,0,0,...,0) so that X(k) = ^/nUD(k)V T is the SVD of 
X truncated to k terms. The model error with respect to Frobenius loss is given by 

ME(k) = —\\y/nUDV' T - X(k)\\l 
np 

The optimal rank is defined with respect to this criterion is 

k* = argminME(/c). 

k 

The problem we consider is how to estimate ME(fc) or k*. 

Closely related to model error is the prediction error. For prediction error, we con- 
jure up a noise matrix E' with the same distribution as E and let X' = y/nUDV T + 
E'. The prediction error is defined as 

PE(k) = -^-E\\X' - X{k)\\ 2 F , 

which can be expressed as 

PE(fc) = E[ME(fc)l + — E\\E\\l. 

np 

The minimizer of PE is the same as the minimizer of E[ME(fc)], and one can get an 
estimate of ME from an estimate of PE by subtracting an estimate of the noise level. 

The previous chapter suggests using Gabriel-style cross-validation for estimating 
the optimal rank. Owen & Perry [65] call this procedure bi-cross- validation (BCV). 
For fold of BCV, we permute the rows of X with matrices and then 
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partition the result into four blocks as 

p«T X Q(i) 



-X^ll -X"l2 
X21 X22, 



We take the SVD of the upper-left block and evaluate its predictive performance on 
the lower-right block. If X u = ^UiD{V T is the SVD of X u and X n (k) = 
y/nUiDi(k)V is its truncation to k terms (with Di(k) defined analogously to 
D(k)), then the BCV estimate of prediction error from this fold is given by 

PE(k;i,j) = — \\X 22 - X 21 X 11 (k) + X 12 \\ 2 F . 

n 2 p 2 

Here, + denotes pseudo-inverse and X 22 has dimensions n 2 xp 2 . For (K, L)-fold BCV, 
the final estimate is the average over all folds: 

i=l j=l 

From PE(A;) we can get an estimate of the optimal rank as k = argmin fe PE(/c). 

In this chapter, we give a theoretical analysis of PE(A;). This allows us to de- 
termine the bias inherent in PE(/c) and its consistency properties for estimating k*, 
along with guidance for choosing the number of folds (K and L). Section 6.1 sets 
out our assumptions and notation. Section 6.2 gives our main results. Then, Sec- 
tions 6.3 and 6.4 are devoted to proofs, followed by a discussion in Section 6.5. 



6.1 Assumptions and notation 

The theory becomes easier if we work in an asymptotic framework. For that, we 
introduce a sequence of data matrices indexed by n: 



X n = yfiiU n D n Vl + E. 



(6.1) 
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Here, X n G W ixp with p = p(n) and ^ — > 7 G (0, 00). Even though the dimensions 
of X n grow, we assume that the number of factors is fixed at k . The first set of 
assumptions is as follows: 

Assumption 6.1. We have a sequence of random matrices X n G R nxp with n — > 00 
and p = a/so oomo to infinity. Their ratio converges to a fixed constant 7 G 

(0,oo) 05 J =7+0 (£). 

Assumption 6.2. T/ie matrix X n is generated as X n = y/nU n D ' n V^ + E n . Here, 
U n G R nxk °, D n G R k " xk ", V n G R pxk °, and E n G R nxp . The number of factors, k , 
is fixed. 

Assumption 6.3. The matrices of left and right factors, U n and V n , have or- 
thonormal columns, i.e. U^U n = Vj t V n = Jfc . Their columns are denoted by 
Un,i,U n ,2, ■ ■ -lUnto and v nil ,u n>2 , ■ ■ -,V nM , respectively. 

Assumption 6.4. The matrix of factor strengths is diagonal: 

D n = diag(d n> i, 4,2, • • • , d nM ). (6.2) 

The strengths converge as d\ { °—i jii and d\ { — fa = Op > strictly ordered as 
Hi> H2> ■■■> Hho > 0. 

Assumption 6.5. The noise matrix E n is independent of U n , D n , and V n . Its 

elements are iid with E nX1 ~ A/"(0, a 2 ). 

These assumptions are standard for latent factor models. 

We can apply the work of Chapter 4 to get the behavior of the model error. We 
let X n = y/nU n D n V n be the SVD of X n and let X n {k) = y/nU n D n (k)V n be its 
truncation to k terms. Then the model error is 

ME„(A;) = —\\yfcU n D n Vl - X n {k)\\\. (6.3) 

With Assumtions 6.1-6.5, we can apply Proposition 4.6 to get that for fixed k as 
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k ko / -J \ 2 

p ■ ME n (k) °-4' ]T + ]T /i, + a 2 ( 1 + — J • (k - k ) + , (6.4) 

i=l i=fc+l ^ V '/ 

where 

/^(3^ 2 + (7 + lH 

= { 2 ( , \ 2 ( 6 - 5 ) 

[l + Ml + ^J otherwise. 
Defining A;* as the minimizer of ME n (/c), we also get that 

k* n °^ max {i : ^ > /i crit } , (6.6) 

where 

provided no //j is exactly eqaul to /i cr it- We therefore know how ME n (A;) and its 
minimizer behave. 

To study the bi-cross-validation estimate of prediction error, we need to introduce 
some more assumptions and notation. As we are only analyzing first-order behavior, 
we can restrict our analysis to the prediction error estimate from a single fold. We 
let P n G R nxn and Q n G IR pxp be permutation matrices for the fold, partitioned as 
P n = (P„,i P„, 2 ) and Q n = (Q n l Q n>2 ) , with P n>1 G R nxn \ P n , 2 G R nxn2 , 
Q nA G lR pxpi , and Q n 2 G R pxp2 . Note that n = n x + n 2 and that p = p 1 + p 2 . We 
define X n>ij = P^XQ^, E n , 3 = P^EQ^, U n>i = P^E/ n , and V nJ = Ql^V n . 
Then in block form, 



-Xn,ll -Xn,12 

yX n2 l X n22/ 



P T n X n Q n = 

*-n,21 -&n,2l) 

- ^fn ( Un ' lDnVn ' 1 U n,i D nV n>2 \ + ( E ri:11 E n , 12 \ ^ 

\U n ,2D n Vn A U nj2 D n V^ 2 ) \ E n,2i E nt22 ) 

This is the starting point of our analysis. 
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~ T 

Now we look at the estimate of prediction error. We let = ^/n U n ^D n ^Y n l 

be the SVD of X n , u - Here, 

U„,i=(u£l u% ••• mS^aw), (6.9a) 

Vn,i = (£} £1 ■■■ £l lApi ), (6.9b) 

and 

D nA = diag (d^, dfy, . . . , d ( ^ niApi ^j . (6.9c) 

For convenience, we define /t^j = {d^lY ■ For < k < n\ Api, we let 

D n>1 (k) = diag (dg, , . . . , , 0, 0, . . . , O) (6.10) 



so that X ni n(/c) = y/nU n ^D njl (k)V n 1 is the SVD of X nill truncated to k terms. 
This matrix has pseudo-inverse X nt u(k) + = ^V nt iD nt i(k) + U nl . Therefore, the 
BCV rank-A; prediction of X 2 2 is 

X ri: 22(k) = X n: 2lX nll {k)X nt i2 

= (^U n , 2 D n V^ A + E n>21 ) (X n<11 (k)+) (^iU 1hl D n Vl 2 + E n>12 ) 
+ U n , 2 D n Vl 1 V n , 1 D n , 1 (k) + ul 1 E n , 1 2 



T 



+ E n , 21 V n , 1 D n)1 (k) + Ul 1 U n , 1 D n Vl2 

+ — E nt21 V nil D n>1 (k) + ul A E n>12 
v ^ 

= V / ^C/n, 2 £)n©n,lA t ,l(A;) + $J i r>nVj 2 

+ U n , 2 D n @ nA D nA (k) + E nA 2 
+ E n , 21 D n ^k) + <bl l D n Vl 2 

+ - i r E n , 21 D 1hl (k) + E n , 12 , 
v ^ 



(6.H) 
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where n ,i = V^V^i, # n ,i = Ul fin^ E nA2 = U n>1 E n>12 , and £„, 2 i = E rh21 V nA . 
Note that i2 n ,i2 and -E„ i2 i have iid jV(0, a 2 ) entries, also independent of the other 
terms that make up X rh22 (k) and X„ j22 . By conditioning on E n:12 and E n ^i, we can 
see that X 22(h) is in general a biased estimate of y/nU nt2 D n V^ 2 . 

To analyze X 22(h), we need to impose some additional assumptions. The first 
assumption is fairly banal and involves the leave-out sizes. 

Assumption 6.6. There exist fixed K,L e (0, 00) (not necessarily integers), such 
that ^ = K + o (±) and^ = L+o (±) . 

ri2 \ vn J P2 \ V n J 

The next assumption is not quite so innocent and involves the distribution of the 
factors. 

Assumption 6.7. The departure from orthogonality for the held-in factors U n> i and 
Vn,i is of order Specifically, 

supE 



sup E 



/ rp m \ z 

nil U nl U nA I ko ) < 00, and 



2 

< 00. 

F 



This assumption is there so that we can apply the theory in Chapter 3 to get at the 
behavior of the SVD of X n li . It is satisfied, for example, if we are performing ro- 
tated cross-validation (see subsection 5.2.4) or if the factors are generated by certain 
stationary processes. Proposition A. 5 in Appendix A is helpful for verifying Assump- 
tion 6.7. It is likely that the theory presented below holds under a weaker condition, 
but a detailed analysis is beyond the scope of this chapter. 

6.2 Main results 

The BCV estimate of prediction error from a single replicate is given by 

PE„(*0 = — \\X r ,22 ~ X n , 22 (h)\\l. (6.12) 
n 2 p 2 
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It turns out that E PE n (fc) is dominated by the irreducible error. However, if we 
have a y^-consistent estimate of a 2 , then we can get an expression for the scaled 
limit of the estimated model error. This expression is the main result of the chapter. 

Theorem 6.8. Suppose that a 2 is a sequence of ^frvp- consistent estimators of a 2 
satisfying E [y/np(a 2 — a 2 )] — > 0. Define the BCV estimate of model error from a 
single replicate as 

ME n (k) = PE n (k) - a 2 n . 



Then, for fixed k as n — > oo ; 

E \p ■ ME n (k) 



kAko ko 

"£2f3i^i+ Yl Vi + V-(k- ko)+, 

i=l i=k+l 



(6.13) 



(6.14) 



where 



^ + ( 7 ^w>0 

T 2 , n 4 when ^ > 



a 2 

fpl' 



1+ ^ 

P-i 



otherwise, 



(6.15a) 



K- 1 L-l 

~K L~ 



(6.15b) 



and 



V = 



K 
K-l 



L-l 
L 



2 ' 



(6.15c) 



It is interesting to compare /3j with the expression for ctj in equation (6.5), which 
appears in the scaled limit of the true model error, p ■ ME n (fc). Although the BCV 
estimator of model error is biased, this bias is small for large fa, K, and L. 

A corollary gives the behavior of the minimizer of the expected model error esti- 



mate. We let k„ be the rank that minimizes E 



ME n (A;) 



Note that k„ is a deter- 



ministic quantity. The next two results follow from Theorem 6.8. 
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Corollary 6.9. As n — > oo ; 



o- 2 [2' 

k n max ^ i : /ij > — — • \ - } , (6.16) 



>/7 V P) ' 



(provided no /ij zs exactly equal to ^= ■ y^, m which case the limit is ambiguous). 

We can use Corollary 6.9 to guide our choice of and L. If we choose them 
carefully, then k n and A;* will converge to the same value. 

Corollary 6.10. // 



where 



y/2 + 7 -i/2- 2 



7 = " ^ , (6-18) 



then k n and k* converge to the same value (provided no Hi is exactly equal to ^L- y |)- 

Interestingly, the first-order-optimal choices of K and L do not depend on the 
aspect ratio of the original matrix. All that matters is the ratio of the number of 
elements in X n ll to the number of elements in X n . 

For a square matrix, 7 = 7 = 1, and the optimal p is |. If we choose K = L, then 
this requires 



so that 

K 

For general aspect ratios (77^ 1), this requires 

K 3 




3 - 2 (VT+3 - vt) 



For very large or very small aspect ratios (7 — > or 7 — > 00), K 1. In these 
situations one should leave out almost all of the matrix when performing bi-cross- 
validation. 
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The remainder of the chapter is devoted to proving Theorem 6.8. 



6.3 The SVD of the held-in block 

The first step in analyzing the BCV estimate of prediction error is to see how the SVD 
of X n:11 behaves. With Assumptions 6.1-6.7, we can start this analysis. Our strategy 
is to use Assumption 6.7 to apply a matrix perturbation argument in combination 
with Theorems 3.4 and 3.5. 

We show that we can apply Theorem 3.5 to ^/nU ny \D n V^ ll + E nj u even though 
U n> i and V n> i do not have orthogonal columns. First we set 

K-l L 

7i = — y-7 (6.19) 

and note that r>± — ni.p_.n_ _^ ( i \ _ Next, for 1 < i < k , we define 

L — 1 . 
= —j (6.20a) 



K-l 



(fjH,i + a 2 ) (l + when /i M > 



/',., < ^ "_"/ \2 71 " M/ ' (6-20b) 




• a 2 (l + -j= j otherwise. 



6 lti ={V L VV wUJ \ ™W ri " Vti' ( 6 20c) 

otherwise, 



, w H ' v¥ • /F^iKsr — > (6 . 20d) 

otherwise. 



For % > h , we put fx lti = ^j^-(? 2 (l + -j=*j ■ Now, for k > 1 we let 0i(fc) and $i(/c) 
be k x matrices with entries 

otherwise, I otherwise, 
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respectively. In this section, we prove the following: 

Proposition 6.11. For fixed k as n — > oo ; the first k columns of ®i >n and $ ln 
converge in probability to ®i(k) and respectively. Likewise, for 1 < i < k, 

a n,i ^ Pl.i ' 

We prove the proposition by leveraging the work of Chapter 3. We have that 

X 

n,n = \/nU nj iD n V nl + E rhU . The first term does not satisfy the conditions 
of Theorems 3.4 and 3.5 since U nj i and V n ,\ do not have orthonormal columns. 
Moreover, the scaling is y/n instead of \Jn{. We introduce scaling constants and 
group the terms as 



With this scaling, 
E 



(6.22) 



^) T (^-)]-[(^-) T (^0 



and the diagonal elements of (^^J^-D n ^j converge to /i 1 /^ ■ ■ ■ > /A^k - P ro P° _ 
sition 6.11 should at least be plausible. 

We prove the result by showing that (^j~^U n ^j ^y^T-D^ {^J^V n ^ is almost 

~ T 

an SVD. We denote its /co-term SVD by U n ^D n ^{V n l) and demonstrate the following: 
Lemma 6.12. Three properties hold: 

(1) For 1 < i < k , \fdl, - dl,\ = O p . 

(2) For any sequence of vectors xi, X2, ■ ■ ■ ,x n with x n G M. n , 



\ I U n \ J x n U n l x n 



\3£n\\2 



(3) For any sequence of vectors y 1; y 2 , . . . , y n with y n e M. p , 




V n A y n -V nl y r , 



= o. 



n 
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Above, d Uji is the ith diagonal entry of D n , which are assumed to be sorted in de- 
scending order. 

Proposition 6.11 is then a direct consequence of Lemma 6.12 combined with Theo- 
rems 3.4 and 3.5. 



Proof of Lemma 6.12. First, let y ^-E/ n ,i = U n ^R n be a Q-R-decomposition so that 
U H:1 E W rllXk ° has orthonormal columns and R n is an upper-triangular matrix. Define 
R n ,i = y/n(Rn 7 i - Ifc ) so that R n = I ko + -j^R n ,i- We can write 

'W rj-i _\_ / _____ ' | ' \ ___ m T 

— U n ,l U n,l = Ik + —j= { R n,l + #„,l) + --R„,l-Rn,l- 



By Assumption 6.7, %Ul tl U n ,i - h = O v . Therefore, R nA = O v (1). 

The same argument applies to show that there exits a V n ,\ £ ]R PlXfe ° with or- 
thonormal columns such that 

Vn,l = Vn,l I Ik "I 7=Sn,l I , 



Pi V v n 

and S n ^i is upper-triangular with S ni i = Cp(l). 
We now look at 



1 T 1 T 

Since the diagonal elements of -D ra are distinct, we can apply Lemma 2.15 twice to 
get that there exist k x k matrices R n ,i, S n i and A n of size Op(l) such that 



with Jfe + ^Rn t i and Jfc + ^<S n ,i both orthogonal matrices, and A n diagonal. 
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We define U n>1 = U n>1 ^I ko + ^=-Rn,i), 



( J fco + and 



V n>1 = V nA (l feo + 
'j[D n + ^Aj. Now, as promised, U ntl D n V^ A is the SVD of 

:^-)(^»)(^-) T 

We can see immediately the property (1) holds. For property (2), note that 

U ny i = U n ,l (jk + —T^Rn,l) 



U n i (lk H l=R n ,l) (lk H l=Rn,l 

rii \ \/n J V x/n 




for some R ny \ = Op(l). Therefore, 



~ T 



1 ~T 



so that 



U n iX n U nl x n — ^—R nl U nl x n 

— T T 1 ~ , 

l^n.l^n — ^"n.l^nlh < ~ /= ||-Rti,i||f ' 11^71,1 ||f ' \\ x n\\2 



A similar argument applies to show that property (3) holds. 



6.4 The prediction error estimate 

In this section, we study the estimate of prediction error 

PE n (fc) = — ||X ni22 -X n)22 (A;)||! 
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We can expand this as 
1 



PEn(*0 = tr ((X n , 22 - X n , 22 (A;)) (X n>22 - X„, 22 (A;)) T ) 



n 2 P2 
1 

n 2 p 2 



— tr ((^U nt2 D n Vl 2 + E n)22 - X n , 22 (A;)) 

• (v^t/ n , 2 D n Vj 2 + £ n , 22 - X n , 22 (A;)) T ) 



1 



n 2 p 2 



\\V^U n , 2 D n Vl 2 -X nt22 (k)\\ 2 F 
+ 2 tr (E n , 22 {yfaU n , 2 D n Vl 2 - X n , 22 (k)f) (6.23) 

+ 1 1 -E7?t.,22 1 1 F 



It has expectation 



E 



PE n (&) 



= E 



n 2 p 2 



\\^U n , 2 D n Vl 2 -X n , 22 {k)\\ 



+ a 2 . 



(6.24) 



The first term is the expected model approximation error and the second term is the 
irreducible error. 

The expected model approximation error expands into four terms. We have 



E\\V^U n , 2 D n Vl 2 -X n , 22 (k)\\l 



= E 



tr ((v^C/ n , 2 (I} n - D n ® n ^D rhl (k) + ^ n ^D n )V T nt2 

- U n)2 D n @ n>1 D nyl (k) + E nA2 

-E n>21 A, a (fc) + *£iAiVj 2 

i=E n 2 iD n i(k) + E n i 2 ) 

■ (\fnU rh2 (D n - D n <d rhl D rhl (k) + $> nA D n )VZ^ 

- U n , 2 D n @ nA D nA (k) + E n , 12 

- ^ B>21 6 B>1 (fc) + *J 1 D B Vj 2 

- -j=E nt21 D nA (k) + E n!l2 ) T ^) 



(6.25) 



6.4. THE PREDICTION ERROR ESTIMATE 



121 



By first conditioning on everything but E njl2 and E n ^i, the cross-terms cancel and 
we get 

E\\V^U n , 2 D n Vl 2 - X n , 22 (k)\\l 

= E\\V^U n , 2 (D n - D n ® n)1 i> n)1 {k) + <S> n)1 D n )Vl 2 \\l 
+ E\\U nt2 D n @ njl D n ^k) + E n)12 \\l 
+ E\\E n , 21 D n , 1 (k) + ®l 1 D n Vl 2 \\ 2 F 

+ E||^£: ni21 A i ,i(A;) + ^„,i2|| F - (6.26) 

Since E n ^ 2 and E Tl:21 are made of iid A/"(0, a 2 ) random variables, the last three terms 
are fairly easy to analyze. We can use the following lemma: 



Lemma 6.13. Let Z e ]R mx ™ be a random matrix with uncorrelated elements, all 
having mean and variance 1 (but not necessarily from the same distribution). If 
A G M. nxp is independent of Z , then 



E\\ZA\\ F = m- E\\A\\ F . 



Proof. The square of the ij element of the product is given by 

n 2 



a=l 

n 



This has expectation 



E 



(za)ji=x;ek] > 



a=l 
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so that 



m p 

E\\ZAf F ^J2J2 E [( ZA )l 

i=i j=i 

rn p n 



= EEE I K1 

i=l j=l a=l 

= m-E||A||p. □ 

We also need a technical result to ensure that after appropriate scaling, the ex- 
pectations are finite. 

Lemma 6.14. For fixed k as n — > oo ; ^ s a ^ mos t surely bounded away from zero. 

Proof. We have that d n k is the fcth singular value of = U n ^D n V n l + 

^E Ujll . This is the same as the kih eigenvalue of 

-X nj iiX^ n = U ntl D n Vl A V n^DnUn! + —=U n)X D n V^E^n 

I b \f I b 



+ —f= E n,llVn,lD n U nl + -E nM E nU . 

\f lb lb 

For each n, we choose O n = (o nt i O n pj & ]R nxn to be an orthogonal matrix with 
O n> 2 G M. nxn ~ k o anc l 0^ 2 U n ,i = 0. Then, with X k (-) denoting the fcth eigenvalue, we 
have 



= A fc (ol^-X nA iXj hU ^O n 



At 



for An.ij = JO^X^nX^nOnj. Note that A„ i2 2 = ±0^ 2 E nill El A1 O n , 2 is an (n - 
k) x (n — k) matrix with iid A/"(0, a 2 ) entries. 

Define G n = Afe(A n 2 2) By the eigenvalue interlacing inequality [36], we have that 
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d ( ni > G n- Moreover, Theorems 2.17 and 2.20 give us that G n ^ a 2 (l + -j=f ■ 
Hence, almost surely for n large enough \ > G n > a 2 . □ 



With these lemmas, we can prove the following: 



Lemma 6.15. The terms in the expected model approximation error converge as: 



E 



P 



n 2 P2 



^U n , 2 (D n - D n e n ,iD n ,i(A;) + * n) iD n )Vj. 



fcAfco 



k/\ko / , \ z fco 

^ ( 1 - J -^-Oiwij ) + 

i=i V V^M ' l=k+l 

U n> 2Dn®n,lD n) i(k) + E n> i2 



E 



E 



V 



n 2 p2 
V 



n 2 P2 



2 fcAfco 

7 i=1 



fcAfco 



2 \^ P« ,5 



and 



E 



P 



™2 P2 



D n Ak) + E n .12 

n 



<7 

7 



E 



<7 



Proof. The squared Frobenius norm in the first term is equal to 

tr ([^U n , 2 {D n - D n & rhl D nA (k) + <S> ri:1 D n )Vl 2 ) 

■ (V^U rh2 {D n - D n e nA D nA {k) + ® nA D n )Vl 2 ) T 

= n ■ tr ((D n - D n e nil A t ,i(^) + *n,i£)n) • (vj 2 y„, 2 ) 

• (£> n - D n e n)1 D ni i(*0 + *n,iA,) • (Ul 2 U n , 2 )) 



(6.27a) 

(6.27b) 
(6.27c) 



(6.27d) 
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N ° W > U l,2 U n,2 ^ ]<Iko, Vl,2 V n,2 ^ \l ko , and 

D n - D n @ n<1 D ntl (k) + <S> 1hl D n 

P ,. ( 1/2 1/2 Q _-l/2/ M 1/2 1/2 l/2 Q _-l/2/ ; x 1/2 

-»• diag (^/V - 01,1^1,1' (fc) (Pl,lfj.i , - ^2 01,2/^1,2 (K) ^1,2/^2 , 

1/2 1/2/) _-l/2/ M l/2\ 

where 

A*i,i if i < k, 
otherwise. 

We can apply the Bounded Convergence Theorem to get the result for the first 
term since the elements of U n> 2, V n ,2, @n,i and $ Tti i are bounded by 1 and since 
Lemma 6.14 ensures that the elements of D 1hl (k) + are bounded as well. 

The last three terms can be gotten similarly by applying Lemmas 6.13 and 6.14. 

□ 




We can now get an expression for the limit of the estimated model approximation 
error. Specifically, 



E 



P 



n 2 P2 



y/nU n , 2 D n Vl 2 - X nfl 2{k) 



A a 2 A a 2 

+ z2 * + - ■ 22 — 



i=k+l 
ko 



i=k +l 



kAko ko a 2 f 1 \ 2 

^2 Pi Hi + ^2 ft H ( 1 + ~/= ) '( k ~ fc o) + ' 

i=l i=k+l ^ ^ v^i/ 



(6.28) 
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where 

A = ( i - x [^0imX + — (r'ol + A) + 1 — 

= 1 - + ''' <>h A, + f (7"% + + (6-29) 

V A*i,i A*i,j A*i,i 7A*i,i 

If A*i,i < then M = </? M = and /i M = • a 2 (l + ^= ) , so that 

ft =i + ^.(.4)-. 

7 Pi V v 7 ^/ 
In the opposite situation (/ii^ > -^=), we define 

L-l K - 1 

and get the simplifications 

^ = P-(i + (i + 7r')- + ^), 

Pi V Pi,i 7iPi/ 

0i,i¥>i,i = p- w — • ( i — °— J , 

VPM V 7iPi,J 

so that 

o / P* i Pi /)2 2 

y Mi,! /*i 

= V 4Y (i - -4-) f i + 2(1 + -,-)- + 3^V) . 

VPW V 7iPi,i/ V Pi,i 7iPi,i/ 

- (7"^ , + ^ J = P 2 f f 1 - Ar V(l + 7: 1 ) — + 2^V) , 
Pi,* V ' ' ^ VPm/ V 7iP 2 ,J V Pm 7iP 2 ,J 

and 

Pi7Pi,i VPM/ V7iPm/ V Pi,i 7iA*i,</ 
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Putting it all together, we get 



= l + 



2^r 

71 Ml, 





a 4 


(- 




+( 


' a 4 




a 4 


-) 


7i/i 2 ,, 


- 1 





a 2 a 4 



J I V 7i/'L7 V ' ' 7i^ 2 ,i 

l + (l+7r 1 )— + 

- , , 2 4 



i+P 2 ^ h^-i i+a+Tr 1 )— + , 



i + (i + 7 r 1 )^ + ^ 



3^v + (i + 7r 1 )- 

7lMi,i V a 7 Ml 

i + (i + 7r x 



Ml,; 7lM? » 



7iM 



- (3a 2 + (71 + 1)// M ) 



i + (i + 7 r 1 )^ + ^ r - 



(6.30) 



T 2 



In particular, note that $ < 1 when — 1 < 0, or equivalently /ij > ^ • y 2 ;. 

Getting the final expressions for $ and 77 in Theorem 6.8 is a matter of routine 
algebra. 



6.5 Summary and future work 

We have provided an analysis of the first-order behavior of bi-cross-validation. This 
analysis has shown that BCV gives a biased estimate of prediction error, with an 
explicit expression for the bias. Fortunately, the bias is not too bad when the signal 
strength is large and the leave-out sizes are small. Importantly, our analysis gives 
guidance as to how the leave-out sizes should be chosen. Our theoretical analysis 
agrees with the simulations done by Owen & Perry [65], who observed that despite 
bias in prediction error estimates, larger hold-out sizes tend to perform better at 
estimating k F . 

The form of consistency we give is rather weak since we did not analyze the 
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variance of the BCV prediction error estimate. As a follow-up, it would be worthwhile 
to address this limitation. For now, though, we can get some comfort from empirical 
observations in [65] that the variance of the estimator is not too large. Indeed, based 
on these simulations, it is entirely possible that the variance becomes negligible for 
large n and p. 
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Appendix A 

Properties of random projections 



We use this section to present some results about random projection matrices. We 
call a symmetric p x p matrix P a projection if its eigenvalues are in the set {0, 1}. 
Any projection matrix of rank k < p can be decomposed as P = VV T for some V 
satisfying V T V = I k . We call the set 

V fc (W p ) = {V e W xk : V T V = I k } C W xk (A.l) 

the rank-k Stiefel manifold ofW. It is the set of orthonormal /c-frames in IR P . Simi- 
larly, we call the set 

g k (W) = {VV r : V e V k (R p )} C W xp (A.2) 

the rank-k Grassmannian of W; this is the set of rank k projection matrices. If 
(v V^j is an p x p Haar-distributed orthogonal matrix and V is p x k, then we say 
that V is uniformly distributed over V k (R p ) and that VV T is uniformly distributed 
over G k (W). 

A.l Uniformly distributed orthonormal k- frames 

We first present some results about a matrix V distributed uniformly over Vfc(M p ). 
We denote this distribution by V ~ Unif (Vfc(IR p )) . In the special case of k — 1, the 
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distribution is equivalent to drawing a random vector uniformly from the unit sphere 
in IR P . We denote this distribution by V ~ Unif (<S P_1 ). 

A. 1.1 Generating random elements 

The easiest way to generate a random element of Vfc(M p ) is to let Z be a p x k matrix 
of iid J\f(0, 1) random variables, take the QR decomposition Z = QR, and let V be 
equal to the first k columns of Q. In practice, there is a bias in the way standard 
QR implementations choose the signs of the columns of Q. To get around this, we 
recommend using Algorithm A.l, below. 

Algorithm A.l Generate a random ortho normal /c-frame 

1. Draw Z, a random px k matrix whose elements are iid A/"(0, 1) random variables. 

2. Compute Z = QR, the QR- decomposition of Z. Set Q Y to be the px k matrix 
containing the first k columns of Q. 

3. Draw S, a random k x k diagonal matrix with iid diagonal entries such that 
P{S n = -l} = P{S n = +l} = |. 

4. Return V = Q^. 



This algorithm has a time complexity of 0(pk 2 ). Diaconis and Shahshahani [23] 
present an alternative approach called the subgroup algorithm which can be used to 
generate V as a product of k Householder reflections. Their algorithm has time com- 
plexity O (pk). Mezzadri [60] gives a simple description of the subgroup algorithm. 

A. 1.2 Mixed moments 

Since we can flip the sign of any row or column of V and not change its distribution, 
the mixed moments of the elements of V vanish unless the number of elements from 
any row or column is even (counting multiplicity). For example, E [V^V^i] = since 
we can flip the sign of the second row of V to get 



v^v 21 = V*(-V 21 ) = 



-v*y 21 . 
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This argument does not apply to V11V12V22V21 or V^V 2 \. 

In the special case when k — 1, the vector (V^, V 2 \, . . . , V^) is distributed as 
Dirichlet (|, |, . . . , |) . This follows from the fact that if Y 1: Y 2 , . . . , Y p are indepen- 
dently distributed Gamma random variables and Yi has shape Oj and scale s, then 
with S = Y7i=i Yi-, the vector ^^S-, ^, . . . , ^ j is distributed Dirichlet (ai, 02, ... , a p ). 
Using the standard formulas for Dirichlet variances and covariances, we get the mixed 
moments up to fourth order. They are summarized in the next lemma. 



Lemma A.l. IfV ~ Unif (Vi(M p )), then 



E [ Vl \] 


1 

P 


(A.3a) 


E[V n V 21 ] 


= 0, 


(A.3b) 


E [V*] 


3 

p(p + 2)' 


(A.3c) 


E [V*V*] 


1 

p(p + 2)" 


(A.3d) 



T/ie odd mixed moments are all equal to zero. 



Using Theorem 4 from Diaconis and Shahshahani [24], which gives the moments 
of the traces of Haar-distributed orthogonal matrices, we can compute the mixed 
moments of V for more general k. Meckes [59] gives an alternative derivation of 
these results. 



Lemma A. 2. IfV ~ Unif (Vfc(IR p )) and k > 1, then the nonzero mixed moments of 
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the elements V up to fourth order are defined by 



E [V?J = i, (A.4a) 

E WW = ^M - < A - 4d > 

E ' v » v » VMV "' = P(P -r ) 1 ( P+ 2) - (A - 4e) 



Proof. The first three equations follow directly from the previous lemma. We can 
get the other moments from the moments of O, a Haar-distibuted p x p orthogonal 
matrix. For the fourth equation, we use that 

E[tr(0)] 4 = ]T E[O rr O ss O tt O uu }. 

r,s,t,u 

Only the terms with even powers of On are nonzero. Thus, we have 

E [tr(O)] 4 = Qe [Oy + Q Qe [Ol l0 l 2 ] 
= pE[0 11 ]+3p(p-l)E[0 2 u 2 22 ]. 

Theorem 4 of Diaconis and Shahshahani [24] gives that E [tr(O)] 4 = 3. Combined 
with Lemma A.l, we get that 

E [Ol l0 l 2 ] = {E [tr(O)] 4 - P E [Of,] } 

1 L 3 

6 — p 



Sp(p — 1) \ 1) 

p + 1 

p(p-l)(p + 2)" 
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For the last equation, we use E [tr(0 )] . We have that 

(0 4 )jj = Oj r O rs O s tOty 
r,s,t 

We would like to compute E [(0 4 ) n ] . Note that unless r = s — t — 1, there are only 
three situations when E [Oi r O rs O st O t i] ^ 0. Two of them are demonstrated visually 
by the configurations 



1 s 

1 f0 lr ••• Or* "A 



si 



r = 1, s = t, s ^ 1 



and 



O 



lr 



t — l,r — s,r ^ 1 



The other nonzero term is when s — 1, r — t, and r ^ 1, so that Oi r O rs O st Oa = 
0\ r 2 rX . In all other configurations there is a row or a column that only contains 
one of {Oi r , O rs , O st , O n }. Since we can multiply a row or a column of O by — 1 
and not change the distribution of O, for this choice of r, s, and t we must have 
0\ r O rs O st Ot\ = —0\ r O rs O st Ot\- This in turn implies that E [Oi r O rs O s tOti\ = 0. 
With these combinatorics in mind, we have that 

E [(° 4 )n] =J2 E [On0 ls O ss O sl ] +J2^[OirO rr O rl 11 ] + J^E [0 2 lr 2 rl ] + E [0 4 u ] 

s^l r^l r^l 

= 2(p - 1)E [O n 12 22 21 ] + (p - 1)E + E [0*J 

= 2(p - 1)E [0 11 0i 2 0220 21 ] + (p - 1)E [0^0 2 2 2 ] + E [O^] 



Again applying Theorem 4 of [24], we have that E [tr(0 4 )] = I. Combined with 
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Lemma A.l, we have 

1 f 1 



E 



[O n 12 22 21 ] = ^—^ j-E [tr(C> 4 )] - E [0*J - (p - 1)E [C^O*,] 



1 fl 3 p+1 

- (p-1): 



2(p-l)ip p(p + 2) v " 'p(p-l)(p + 2) 
-1 



p(p-l)(p + 2)- 



□ 



A. 2 Uniformly distributed projections 

When P G IR pxp is chosen uniformly over the set of rank-k p x p projection matrices, 
we say P ~ Unif (£ fc (IR p )) . With the results of the previous section, we can derive the 
moments of random projection matrices. 

Lemma A.3. If P ~ Unif (G k (R p )) , then 

E [P] = k -I v . (A.5) 

Proof. Write P = VV T , where V ~ Unif (V fc (M p )) . For 1 < i, j < p we have 

k 

r=l 

SO 



E[P 



/cE^], wheni=j, 
/cE [V11V21] , otherwise. 
The result now follows from Lemma A.l. □ 

Lemma A.4. Let P ~ Unif (£ fc (M p )) . If 1 < i,j,i',f < p, then 

^^-JTW*>(i)H) ( A,) 



A.2. UNIFORMLY DISTRIB UTED PROJECTIONS 



135 



This gives us that aside from the obvious symmetry (P^ = Pji), the off- diagonal 
elements of P are uncorrelated with each other and with the diagonal elements. 



Proof. We need to perform six computations. As before, we use the representation 
P = VV T , where V ~ Unif (V fe (M p )) . We have 



which gives us that 



Next, 



ft 

(X 

1=1 



v, 



1-4 



1 E^+E^5 

1=1 l^J 

/,• • , 3 ^ + k (k - 1) 



P (p + 2) 
A; (A; + 2) 

P (p + 2) 

2 / _ k\ ( k 

P + 2 VP/ V P/ VP 



Var [P n ] = -^o (") ( 1 " " 
P + 2 VP/ V P 



n (p + 2) 



so that 



E [P 2 ,] = E 



i=i 



E 
fc 



i=l 



1 



+ fc(fc- 1) 



1 



P (P + 2) 

p /A; 



(p-l)(p + 2) VP 
Var [P 12 ] = 



p(p-l)(p + 2) 

Af 



P 



P 



.-) (i-- 

(P- !)(p + 2) VP/ V P 
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Also, 



E [P11P22] = E 



E^)(E^ 



i=l 



J'=l 
k 



E 



E^ + EE^5 

i=l j^i 



i=l 



+ jfe(ib- 1) 



P+ 1 



p(p + 2) p(p-l)(p + 2) 

/ . /c 1 

l + (p+l) 



p(p + 2) 
-2 



p — 1 



fc\ / k 
(p- l)(p + 2) Vp/ V P 



so that 



Cov [P n ,P 22 ] 



-2 



(p-l)(p + 2) VP 



1 - 



7; 



Since P is symmetric, we have 



Cov[P 12 ,P 21 ] = Var[P 12 ] 



The other covariances are all zero. This is because 



E [PnPi 2 ] = E 
E [Pi 2 P 23 ] = E 



1,3 



and 



E [Pi 2 P 34 ] = E 



Each term in these sums has an element that appears only once in a row. Thus, the 
expectations are all 0. □ 
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A. 3 Applications 

We now present two applications of the results in this section. 



A. 3.1 Projections of orthonormal A;- frames 

Suppose we have U G W xk , an orthonormal fc-frame, and we randomly project the 
columns of U into W, with q < p. If we denote the projection matrix by V r and set 
U = V T U, it is natural to ask how close U is to being an orthonormal /c-frame. We 
can prove the following: 

Proposition A.5. Suppose U G W xk satisfies U^U = I k . Let V ~ Unif (V 9 (M P )) 
with k < q < p and set U = \fp~j 'qV T U '. Then there exists a decomposition U = 
Uq + such that U Uq = I k and 

EH^II^^^ + l)^) 2 ^-^)- (A.7) 
In particular, this implies that E\\Ui\\ 2 F < ^ k (k + 1). 

~ T - 

The main ingredients of the proof are a perturbation lemma and a result about U U, 
stated below. 



Lemma A.6. Suppose U G W pxk satisfies U T U = I k . Let V ~ Unif (V g (M p )) with 
k<q<p and set U = ^/p/qV T U. Then E E/ T L/] = I k and 



E 



^(u T u-i k )\\ F <k(k + i) (J 



1 - 



V 



(A.8a) 



with a matching lower bound of 



E 



v /g(E/ T E/-/ fe )|[>A;(A; + l) (l 



1 - 



p 



p J \p + 2 



(A.8b) 
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Proof. Set P = VV T so that tfjJ = v - U T PU. Now, 



E 



~ T ~ ' 

u u 



= - C7 T E [P] U = I k . 



Also, since O t PO = P for any p x p orthogonal matrix O, we must that U T PU 
has the same distribution as the upper k x k submatrix of P. This implies that 



E 



• T ~ 



u U-I k 



k k 

i=i j=i 



/it. + fc(fc-i) 

p k (k + 1) ^ q\ ( q 



V 



(p-l)(p + 2) VP 
The lower and upper bounds follow. 



p 



(p-l)(p + 2) 
2 



p(k + I) J ' 



□ 



The next ingredient is a perturbation theorem due to Mirsky, which we take from 
Stewart [84] and state as a lemma. 

Lemma A. 7 (Mirsky). If A and A + E are in R nxp , then 



nAp 



J2HA + E)-a t (A)) 2 <\\E\\% 



(A.9) 



where crj(-) denotes the ith singular value. 

We can now proceed to the rest of the proof. 

~ T ~ 

Proof of Proposition A. 5. We have that U U = + E, where E||^/5£7||| < C and 
C is given in Lemma A. 6. We can apply Lemma A. 7 to get 



i=i 



Setting = <Tj(C7 C7) — 1, we have <Ji(U) = y/1 + el. Note that Efg^ef] < 
EH^/g^Hp < C. Let R and 5bepxfc and k x k matrices with orthonormal columns 
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such that 

U = R{I k + A)5 T 

is the SVD of U, where A = diag(Ai, A 2 , . . . , A fc ), and A; = y/T+el - 1. Set 
C7o = and t/i = ^/q RAS T . Then, 

C/q t/o = SR T RS T = SS T = I k 

and 



i=l 



where we have used that \/l + e~i > 1 + \ei — \e\. □ 



A. 3. 2 A probabilistic interpretation of the Frobenius norm 

As another application of the results in this section, we give a probabilistic represen- 
tation of the Frobenius norm of a matrix. It is commonly known that for any n x p 
matrix A, 

sup {x T Ay) 2 = \\A\\l (A.10) 

IM| 2 =i, 

yi2=i 

where || • || 2 is the spectral norm, equal largest singular value (see, e.g. [36]). The 
function f(x,y) = x T Ay is a general bilinear form on W 1 x W p . The square of the 
spectral norm of A gives the maximum value of (f(x,y)) 2 when x and y are both 
unit vectors. It turns out that the Frobenius norm of A is related to the average 
value of (f(X, Y)) when X and Y are random unit vectors. 

Proposition A.8. If A e R nxp , then 



I 



(x T Ay) 2 dxdy = — \\A\\ 2 F . (A.ll) 

- np 



\m\2=i, 
\\yh =1 



Proof. There are two steps to the proof. First, we show that if X ~ Unif(5 n x ) and 
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a is arbitrary, then 



E[X T a] 2 = 1|1 - 112 



a 



2' 



n 

Next, we show that if Y ~ Unif(5 p " 1 ), then 

E||Ay||l = ^\\A\\l. 

The result follows from these two facts. To see the first part, since X is orthogonally 
invariant we have 

X T Q = X T (||fl|| 2 ei) = ||fi||2^i. 
To see the second part, let A = UT,V T be the SVD of A and write 

nAp 

\\AY\\l = \\^Y\\l ± \\m\l = J2^(A) Yl □ 



Appendix B 



Limit theorems for weighted sums 



In this appendix we state and prove some limit theorems for weighted sums of iid 
random variables. First, we give a weak law of large numbers (WLLN): 

Proposition B.l (Weighted WLLN). Let X n l , X n2 , • • • , X„ in be a triangular ar- 
ray of random variables, iid across each row, with EX n> i = \x and EX^ l uniformly 
bounded in n. Also, let W n> i, W n> 2, ■ ■ ■ , W n>n be another triangular array of ran- 
dom variables independent of the X n ^ (but not necessarily of each other). Define 



W n = ^ Y17=i W n ,i- If W n —> W and E YH=i ^n,i] ^ s uniformly bounded in n, then 



We do not give a proof of Proposition B.l, but instead derive a strong law of large 
numbers (SLLN) below. The proof of the weak law is similar. Here is the strong law: 

Proposition B.2 (Weighted SLLN). Let X n>1 , X n>2 , ■ ■ ■ , X n>n be a triangular array of 
random variables, iid across each row, with EX n> i = fi and EX^ 1 uniformly bounded 
in n. Also, let W n> i, W n> 2, ■ ■ ■ , W n ^ n be another triangular array of random variables 
independent of the X n ^ (but not necessarily of each other). Define W n = ^ Y^i=i W n> i. 
If W n W and E Yli=i ^n,i] ^ s uniformly bounded in n, then 




(B.l) 



i=l 




(B.2) 



i=l 
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Proof. Define S n = £™=i W n ,X n , and let = a(W n>1 , W n>2 , W n>n ). We have 
that 

1 - 1 n 

-S n - W n /I = - V W n>i (X„ ti - fj) 



i=l 



Note that 

4 ( E - E + 3 E K^n.i - ^) 2 (^ - ^) 2 ] E 



E 



1=1 



<Ci 



n 
i=l 



for some constant C\ bounding E [X„ 5 i — /i] 4 and 3E [(X nj i — /i) 2 (X n>2 — /i) 2 ]. There- 
fore, the full expectation is bounded by some other constant C 2 . We have just shown 
that 



E 



1 



n 



Sn-Wn/J, 



< 



c 



TV- 



for some constant C. Applying Chebyschev's inequality, we get 



n 



-S n -W n fi 



>e < 



C 
rite*' 



Invoking the first Borel-Cantelli Lemma, see the sum converges almost surely. □ 

Next, we derive a central limit theorem (CLT). To prove it we will need a CLT 
for dependent variables, which we take from McLeish [58]: 

Lemma B.3. Let X n j, J 7 ^, i = l,...,n be a martingale difference array. If the 
Lindeberg condition J^" =1 E [X^; \X n ^\ > e\ — > is satisfied and YH=i-^ni ~* °" 2 
then Zti X n,i^^(0, o 2 )- 

With this Lemma, we can prove a CLT for weighted sums. 

Proposition B.4 (Weighted CLT). Let X n ^% = l,...,n be a triangular array of 
random vectors in W p with X n ,i iid such that EX n l = fi x , Cov[X n ,i] = S x 7 and all 
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mixed fourth moments of the elements of X n ,i are uniformly bounded in n. Let W n ,i, 
i = 1, . . . , n be another triangular array of random vectors in W, independent of the 
X n ,i but not necessarily of each other. Assume that ^ Yli=i ^n,iWn,i ~* ^ W , and 
that for all sets of indices ji,j2, jz, Ja, with each index between 1 andp, we have that 
E [n XXi W n,ijiW ntij2 W ntij3 W ntij4 ] is uniformly bounded in n. Then 



n 1 n 

-Y d w n fX n , i --Y,w, 

i=i i=i 



A/"(0, E W .E X ), (B.3) 



where • denotes Hadamard (elementwise) product. 



Proof. Let 



1 n 1 n 



x 



i=l i=l 

We will use the Cramer- Wold device. Let G 1R P be arbitrary. Define 



1 



so that ^S 1 ™ = Z)r=i^«,»- Also > witn ^ = a ( Y n,i,Yn,2, ■ ■ ■ ,^n,i-i), the collection 
{F nj j,jF n j} is a martingale difference array. We will use Lemma B.3 to prove the 
result. First, we compute the variance as 



n 1 n 

Yl Y n,i = ~ 
i=l i=l 



^ n n p 

= — ^ ^ ^ dj9 k W nji jWn t ik(X nti j — fxf)(X nt i k — fj,*) 
U i=l j=l k=l 

3=1 k=l 

= 9 T (S w . S x ) 6, 



where we have used the fourth moment assumptions and Proposition B.l to get the 
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i=i 



convergence. Lastly we check the Lindeberg condition. We have that 

n 

E E K/. \Y,J>e] 

£^E E Kd 

i=l 
1 n 

^2^2 ^2 { (>i l> J' l> J- l> i: • E 1 1 1 1 »•'./> 1 1 '«•'./.; 1 1 '«•'./;* 



i=l ii,-J4 



Ci 



< 

e 2 n 



< 



E E 

I,- 



n,-,j4 



1 " 

i=l 



U4 



e 2 n 



where Ci is a constant bounding |^| 4 and the centered fourth moments of X n>i j, and 
C 2 bounds the fourth moments of the weights. □ 

For some classes of weights, we can get a stronger result. 

Corollary B.5. With the same assumptions as in Proposition B.J,., if the mean of 
the weights converges sufficiently fast as *Jn Y17=i Wn,i ~ l^ W ] 0? then 



1 n 

~ J"] Wn,i • x n , t - n 



M{0, E^.E X ) 



(B.4) 
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